You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jimmy Wan <ji...@indeed.com> on 2008/11/11 19:39:11 UTC

Recommendations on Job Status and Dependency Management

I'd like to take my prototype batch processing of hadoop jobs and implement
some type of "real" dependency management and scheduling in order to better
utilize my cluster as well as spread out more work over time. I was thinking
of adopting one of the existing packages (Cascading, Zookeeper, existing
JobControl?) and I was hoping to find some better advice from the mailing
list. I tried to find a more direct comparison of Cascading and Zookeeper but
I couldn't find one.

This is a grossly simplified description my current completely naive
approach:

1) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs.

2) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs that are dependent on the output of step #1. These
are currently separated from the tasks in step #1 mainly because it's easier
to group them up this way in the event of a failure, but I expect this
separation to go away.

3) At the end of the month, serially run a series of jobs outside of
Map/Reduce that basically consist of a single SQL query (I could easily
convert these to be very simple map/reduce jobs, and probably will, if it
makes my job processing easier).

The main problems I have are the following:
1) right now I have a hard time determining which processes need to be run
in the event of a failure.

Every job has an expected input/output in HDFS so if I have to rerun
something I usually just use something like "hadoop dfs -rmr <path>" in a
shell script then hand edit the jobs that need to be rerun.

Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information.

The job dependencies while enumerated well are not isolated all that well.
Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
that one process and any dependent processes, but not have to rerun
everything again.

2) I typically run everything 1 month at a time, but I want to keep the
option of doing rollups by day. On the 2nd of the month, I'd like to be able
to run anything that requires data from the 1st of the month. On the 1st of
the month, I'd like to run anything that requires a full month of data from
the previous month.

I'd also like my process to be able to account for system failures on
previous days. i.e. On any given day I'd like to be able to run everything
for which data is available.

3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
want to run too many of those types of jobs at the same time since it affects
my MySQL performance. I'd like some way of describing some type of
lock on external resources that can be shared across jobs.

Any recommendations on how to best model these things?

I'm thinking that something like Cascading or Zookeeper could help me here.
My initial take was that Zookeeper was more heavyweight than Cascading,
requiring additional processes to be running at all times. However, it seems
like Zookeeper would be better suited to describing mutual exclusions on
usage of external resources. Can Cascading even do this?

I'd also appreciate any recommendations on how best to tune the hadoop
processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
in the near future.

Re: Recommendations on Job Status and Dependency Management

Posted by Jimmy Wan <ji...@indeed.com>.
Figured I should respond to my own question and list the solution for
the archives:

Since I already had a bunch of existing MapReduce jobs created, I was able to
quickly migrate my code to Cascading to take care of all the inter-hadoop
job dependencies.

By making use of the MapReduceFlow and dumping those flows into a Cascade
with a CascadeConnector, I was able to throw out several hundred lines of
hand-created Thread and dependency management code in favor of an automated
solution that actually worked a wee bit better in terms of concurrency. I was
able to see an immediate increase in the utilization of my cluster.

I covered how I worked out the initial HDFS-dependencies in the other reply
to this message.

For determining the proper way to determine whether the trigger conditions
are met (reliance on outside processes for which there is no easy way to read
a signal), I'm currently polling a database for that data and I'm working
with Chris to add a hook into Cascade to allow pluggable predicates to
specify that condition.

So yeah, I'm sold on Cascading. =)

Relevant links:
http://www.cascading.org

Relevant API Docs
http://www.cascading.org/javadoc/cascading/flow/MapReduceFlow.html
http://www.cascading.org/javadoc/cascading/cascade/CascadeConnector.html
http://www.cascading.org/javadoc/cascading/cascade/Cascade.html

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr <path>" in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on
>usage of external resources. Can Cascading even do this?
>
>I'd also appreciate any recommendations on how best to tune the hadoop
>processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
>so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
>might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
>in the near future.

re: Recommendations on Job Status and Dependency Management

Posted by Jimmy Wan <ji...@indeed.com>.
I was able to answer one of my own questions:

"Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information."

The following code was all that I needed:
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(configuration);
        Path path = new Path(filename);
        boolean fileExists = fileSystem.exists(path)

At first, the code didn't work as I expected because my working shell scripts
that made use of "hadoop/bin/hadoop jar my.jar" did not explicitly include
HADOOP_CONF_DIR in my classpath. Once I did that, everything worked just
fine.

On Tue, 11 Nov 2008, Jimmy Wan wrote:

>I'd like to take my prototype batch processing of hadoop jobs and implement
>some type of "real" dependency management and scheduling in order to better
>utilize my cluster as well as spread out more work over time. I was thinking
>of adopting one of the existing packages (Cascading, Zookeeper, existing
>JobControl?) and I was hoping to find some better advice from the mailing
>list. I tried to find a more direct comparison of Cascading and Zookeeper but
>I couldn't find one.
>
>This is a grossly simplified description my current completely naive
>approach:
>
>1) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs.
>
>2) for each day in a month, spawn N threads that each contain a dependent
>series of map/reduce jobs that are dependent on the output of step #1. These
>are currently separated from the tasks in step #1 mainly because it's easier
>to group them up this way in the event of a failure, but I expect this
>separation to go away.
>
>3) At the end of the month, serially run a series of jobs outside of
>Map/Reduce that basically consist of a single SQL query (I could easily
>convert these to be very simple map/reduce jobs, and probably will, if it
>makes my job processing easier).
>
>The main problems I have are the following:
>1) right now I have a hard time determining which processes need to be run
>in the event of a failure.
>
>Every job has an expected input/output in HDFS so if I have to rerun
>something I usually just use something like "hadoop dfs -rmr <path>" in a
>shell script then hand edit the jobs that need to be rerun.
>
>Is there an example somewhere of code that can read HDFS in order to
>determine if files exist? I poked around a bit and couldn't find one.
>Ideally, my code would be able to read the HDFS config info right out of the
>standard config files so I wouldn't need to create additional configuration
>information.
>
>The job dependencies while enumerated well are not isolated all that well.
>Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
>that one process and any dependent processes, but not have to rerun
>everything again.
>
>2) I typically run everything 1 month at a time, but I want to keep the
>option of doing rollups by day. On the 2nd of the month, I'd like to be able
>to run anything that requires data from the 1st of the month. On the 1st of
>the month, I'd like to run anything that requires a full month of data from
>the previous month.
>
>I'd also like my process to be able to account for system failures on
>previous days. i.e. On any given day I'd like to be able to run everything
>for which data is available.
>
>3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
>want to run too many of those types of jobs at the same time since it affects
>my MySQL performance. I'd like some way of describing some type of
>lock on external resources that can be shared across jobs.
>
>Any recommendations on how to best model these things?
>
>I'm thinking that something like Cascading or Zookeeper could help me here.
>My initial take was that Zookeeper was more heavyweight than Cascading,
>requiring additional processes to be running at all times. However, it seems
>like Zookeeper would be better suited to describing mutual exclusions on
>usage of external resources. Can Cascading even do this?
>
>I'd also appreciate any recommendations on how best to tune the hadoop
>processes. My hadoop 0.16.4 cluster is currently relatively small (<10 nodes)
>so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
>might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
>in the near future.
>

--