You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@oozie.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2011/09/08 07:19:10 UTC

[jira] [Created] (OOZIE-348) GH-561: Redesign oozie internal queue

GH-561: Redesign oozie internal queue
-------------------------------------

                 Key: OOZIE-348
                 URL: https://issues.apache.org/jira/browse/OOZIE-348
             Project: Oozie
          Issue Type: Bug
            Reporter: Hadoop QA


We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.


I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.

The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:

1. Implement the queue idea into DB:
   Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.

  Cons: Extra DB access overhead.

  Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.

2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  

Currently queuing the same command at the end created starvation ( live-lock)  like situation.

3. Multiple queues. One for coordinator input check that is used 99% of time.

Comments?

Regards,
Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hadoop QA resolved OOZIE-348.
-----------------------------

    Resolution: Fixed

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100018#comment-13100018 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

tucu00 remarked:
A few points:

* Using the DB as queue storage will be overkilling for the DB.

* Queue overflow is already handled by Oozie as commands are regenerated from the DB state.

* Multiple queues will complicate the system, the current mechanism already handles priorities, buckets (concurrency control for a given command type) and anti-starvation.

My take is that we should fix unique command queuing, that will solve most of not all the issues.

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101905#comment-13101905 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

mislam77 remarked:
How could we ensure the re-queuing will not disturb the ordering?

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Roman Shaposhnik (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Shaposhnik reopened OOZIE-348:
------------------------------------


> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101903#comment-13101903 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

mislam77 remarked:
So there will be 2 queues. One for coordinator input checks (queue 1) and other for the rest of the commands (queue 2).
In this approach, the questions are:
* Will there be 2 threadpools? I assume it will be.

* For queue 2, the re-queuing will still happen. right? Although we don't see any problem for other commands at this point, do you think similar situation could happen later. Since re-queuing perturbs the original ordering, the queue processing will be unfair.Considering this should not we look for other approach.

* How does the threadpool size impact the system? The reason is, we would like to increase the default thread pool size from 120.

Can we discuss the other approach too? Using queue in DB.

If we want to implement hot-hot or load balancing system (a possible future direction), I think DB approach will help that.
In the current approach, the same queue will be created into both system (although both might not process the same command) resulting the unnecessary overhead of keeping the same element into both queues.

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101904#comment-13101904 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

tucu00 remarked:
It seems to me that the requeueing logic is not correct, it should not alter the order, but just ignore the dup queueing leaving the original one in the existing place in the queue.

Default threadpool size to 120 is a bit too high for a default value. That should be a site configuration value. The optimum size o the threadpool is given by the load of your system and the hardware/OS resources you have.

IMO, a database will be an  overkill. I would not replace the existing inmemory solution by a DB solution, rather I'd leverage the fact that services are pluggable and have a DB solution as well. Still, I'd suggest you test your current load with a DB solution.

Regarding the comment that DB approach would be good for a hot-hot solution, load distribution for an immemory solution could be easily handled by doing something like handling IDs that  satisfy JOBID MOD ${LIVE_OOZIE_INSTANCES} == ${OOZIE_INSTANCE_ID}, the number of live instances and the intance ID would be dynamically generated/stored in Zookeeper (which would be needed to provide distributed lock support).

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101901#comment-13101901 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

mislam77 remarked:
Queue uniqueness is already implemented. Surely it reduces the occurrence of problem.
However, didn't eliminate that as you mentioned.

As part of concurrency control, we are re-queuing the same command with 500ms delay at the head of the queue. In a high loaded system, the same command could be re-queued and causes livelock like situation. 
Consider an example where there are nearly 10K unique coordinator input checks. 
And the maximum concurrency is 40. After first 40, all of them will get re-queued until one command is done. This type of situation continues for sometime.

The similar situation has created a big trouble in production.

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101902#comment-13101902 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

tucu00 remarked:
Well, then the solution would be to use a separate queue exclusively service for coordinator input checks. in that case the  threadpool will be the only throttling and no concurrency re-queueing would happen.

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (OOZIE-348) GH-561: Redesign oozie internal queue

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/OOZIE-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101906#comment-13101906 ] 

Hadoop QA commented on OOZIE-348:
---------------------------------

tucu00 remarked:
you'd have a UniqueQueue implementation that has an SET for element IDs.

The add/offer methods of the UniqueQueue will first check if the element is in the ID set, if it is the add/offer does a NOP, if it is not it add the element to the queue and to the ID set. The poll/take/remove of elements have to remove the element from the ID set as well. All this has to be done with proper level of synchronization/locking to avoid race conditions.

> GH-561: Redesign oozie internal queue
> -------------------------------------
>
>                 Key: OOZIE-348
>                 URL: https://issues.apache.org/jira/browse/OOZIE-348
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Hadoop QA
>
> We had a lot of issues related to oozie internal queue. It includes queue overflow as well as re-queuing the same overly used commands to avoid starvation. There are other situations too. This problem becomes very obvious in very high-load case.
> I would like to open-up the discussion to find out a better architectural design  for longer term considering a very high-load situation.
> The following proposals are to initiate the discussion that varied from complete overhaul to adjusting the current design:
> 1. Implement the queue idea into DB:
>    Pros: Persistence. In hot-hot or load balancing situation it useful. Single place of truth. Different level of ordering could be done as needed through SQL. Don't bother about queue size. Don't need to recreate in every restart -- recovery service might be less busy.
>   Cons: Extra DB access overhead.
>   Middle approach could be to keep a memory cache with strict conditions. The details could be discussed later.
> 2. Re-queuing the same commands (that is used for throttling) -- should be redesigned. In this case, make sure queuing happens in the *same* place -- not at the end of queue. I know this will break the queue meaning. In this case, we might need to use a different data structure.  
> Currently queuing the same command at the end created starvation ( live-lock)  like situation.
> 3. Multiple queues. One for coordinator input check that is used 99% of time.
> Comments?
> Regards,
> Mohammad

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira