You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/01/14 01:02:33 UTC

[GitHub] [drill] paul-rogers commented on pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

paul-rogers commented on pull request #2139:
URL: https://github.com/apache/drill/pull/2139#issuecomment-759855911


   @vdiravka, thank you for taking a look at this issue. It has been a while since I wrote this stuff. I've poked around a bit to refresh my memory of what we were trying to do.
   
   The goal is to have a config system that draws from both the HDFS config and from the DoY config files. When this was written, Drill did not use the HDFS config files; the config info was stored by Drill in ZK. Also, during testing, we might target any number of HDFS systems, so we had to support more than just the local HDFS config file.
   
   Those considerations led us to use a combination of the Drill-style config file and the HDFS config. Note that the DoY config is *not* the same as the Drill config (though both use the same HOCON library.) There is a chicken-and-egg problem: for the DoY client, Drill is a zip file; there is no actual Drill installed on the machine acting as the DoY client. Also, a single DoY client can run any number of Drill instances. Hence, DoY does not read a Drill config file; the DoY config is a separate entity.
   
   With that background, let's look at the [config file](https://github.com/apache/drill/blob/master/drill-yarn/src/main/resources/org/apache/drill/yarn/core/drill-on-yarn-defaults.conf):
   
   ```
   drill.yarn: {
     app-name: "Drill-on-YARN"
   
     # Settings here support a default single-node cluster on the local host,
     # using the default HDFS connection obtained from the Hadoop config files.
   
     dfs: {
       connection: ""
       app-dir: "/user/drill"
     }
     ...
   ```
   
   Note that `drill.yarn.dfs.connection` is supposed to the the target DFS connection. So, if you set that to `hdfs://localhost:9000`, things should work.
   
   You noted that we obtain the information twice. The code in [DfsFacade.connect()](https://github.com/apache/drill/blob/master/drill-yarn/src/main/java/org/apache/drill/yarn/core/DfsFacade.java) appears correct:
   
   ```
     public void connect() throws DfsFacadeException {
       loadYarnConfig();
       String dfsConnection = config.getString(DrillOnYarnConfig.DFS_CONNECTION);
       try {
         if (DoYUtil.isBlank(dfsConnection)) {
           fs = FileSystem.get(yarnConf);
         } else {
           URI uri;
           try {
             uri = new URI(dfsConnection);
           } catch (URISyntaxException e) {
             throw new DfsFacadeException(
                 "Illformed DFS connection: " + dfsConnection, e);
           }
           fs = FileSystem.get(uri, yarnConf);
         }
       } catch (IOException e) {
         throw new DfsFacadeException("Failed to create the DFS", e);
       }
     }
   ```
   
   We first use the value from the DoY config. If not set, we fall back to the Hadoop config. This behavior gives us what we want: DoY config first, else fall back to the Hadoop config.
   
   The initial note also mentions `loadYarnConfig()`. This method loads, well, the YARN config. The original thought, IIRC, is that the YARN config points to the YARN services; not the HDFS services. Also, the idea was that DoY works with a single YARN instance, identified by its YARN config. (Now, in reality, I suppose that there must be only one HDFS per YARN.) So, it should not be the case that the YARN config breaks HDFS.
   
   Is it the case, in this instance, that the HDFS server is configured in YARN, but not in the HDFS config files?
   
   At this point, I suspect I've exhausted my YARN knowledge. I can, however, offer a suggestion.
   
   If the YARN config holds (or reads) the HDFS config more accurately than the default HDFS config, then change the `connect()` method above to use the (cached) YARN config.
   
   Before doing this, I'd recommend researching these issues a bit more in the YARN and HDFS configs. Maybe that fs URI check in `loadYarnConfig()` is wrong.
   
   It may also be that, to use YARN, you must have a valid YARN config and a valid HDFS config to go with it, so we might not need the DoY connect config. (I suspect this config was added, in part, because MapRFS didn't use the HDFS configs, but I could be wrong.)
   
   So, perhaps do a bit more homework, then we can refine the fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org