You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Jacob Tolar (Jira)" <ji...@apache.org> on 2022/08/19 17:35:00 UTC

[jira] [Created] (OOZIE-3668) Simplify setting oozie.launcher.mapreduce.job.hdfs-servers

Jacob Tolar created OOZIE-3668:
----------------------------------

             Summary: Simplify setting oozie.launcher.mapreduce.job.hdfs-servers
                 Key: OOZIE-3668
                 URL: https://issues.apache.org/jira/browse/OOZIE-3668
             Project: Oozie
          Issue Type: New Feature
            Reporter: Jacob Tolar


When running oozie jobs that depend on cross cluster HDFS paths, I am required to provide the parameter {{oozie.launcher.mapreduce.job.hdfs-servers}}.

This is a pain to manage when there are many datasources, or when the same coordinator/workflow is deployed to multiple clusters (e.g. staging, production) which have different cross-cluster data access requirements. We need to keep track of the datasets and nameNode lists in two places.

It's especially obnoxious if you are using something like an HCatalog table with partitions registered on a different HDFS. In that case, you can define your dataset and Oozie's coordiantor takes care of all the details no matter where the partitions are stored, but the workflow will fail unless you inspect the table and add the correct name nodes to the hdfs-servers setting.

If you are using Oozie coordinators with data dependencies to schedule jobs, Oozie should have access to all the required information to provide this setting automatically which would help to eliminate errors when it the setting is missing or set incorrectly. 

I think there are two reasonable approaches which should be feasible. They're not necessarily mutually exclusive, but I would be happy with just one of them: 

1. Oozie sets the value automatically

In this case, Oozie coordinator execution is updated to compute the list of hdfs-servers and pass it through to the workflow via the configuration. The Oozie workflow execution is updated to use the value provided by the coordinator as the default value for {{oozie.launcher.mapreduce.job.hdfs-servers}} if the setting is not provided.

The user should still be able to override the setting if needed. It would be helpful if there were a way for the user to specify *additional* hdfs-servers (i.e. specify {{oozie.launcher.mapreduce.job.hdfs-servers=${oozie.coord.hdfs-servers},hdfs://name-node}} everything computed by the coordinator plus something else), but that may be an uncommon use case. 

2. Oozie provides EL functions for easily computing the {{hdfs-servers}} setting

In this case, Oozie could be updated to provide three new coordinator functions. The output could be passed through to the workflow and used as needed by the user.

1.  {{coord:getAllDatasetHdfsServers()}}: takes no parameters and outputs a string.

This function will iterate over all {{dataIn}} and {{dataOut}} configured in the coordinator, and construct a string suitable for passing to the workflow parameter {{oozie.launcher.mapreduce.job.hdfs-servers}} . It should work for all supported dataset types (e.g. HDFS, HCatalog, etc). 

2. {{coord:getDataInHdfsServers(String dataIn)}}: Takes one parameter and outputs a string. 

This function does the same thing as (1), but for the specified dataIn dataset. 

3. {{coord:getDataOutHdfsServers(String dataOut)}}: Takes one parameter and outputs a string. 

This function does the same thing as (1), but only for the specified dataOut dataset. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)