You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/01/07 02:23:05 UTC

[GitHub] [drill] cgivre opened a new pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

cgivre opened a new pull request #2139:
URL: https://github.com/apache/drill/pull/2139


   # [DRILL-6268](https://issues.apache.org/jira/browse/DRILL-6268): Drill-on-YARN client obtains HDFS URL Incorrectly
   
   ## Description
   
   The Drill-on-YARN client must upload files to HDFS so that YARN can localize them. The code that does so is in DfsFacade. This code obtains the URL twice. The first time is correct:
    
   ```java
    private void loadYarnConfig() {
       ...
         URI fsUri = FileSystem.getDefaultUri( yarnConf );
         if(fsUri.toString().startsWith("file:/")) {
           System.err.println("Warning: Default DFS URI is for a local file system: " + fsUri);
         }
       }
     }
   ```
   The `fsUri` returned is `hdfs://localhost:9000`, which is the correct value for an out-of-the-box Hadoop 2.9.0 install after following these instructions. The instructions have the reader explicitly set the port number to 9000:
   ```xml
   <configuration>
       <property>
           <name>fs.defaultFS</name>
           <value>hdfs://localhost:9000</value>
       </property>
   </configuration>
   ```
   The other place that gets the URL, this time or real, is `DfsFacade.connect()`:
   ```java
       String dfsConnection = config.getString(DrillOnYarnConfig.DFS_CONNECTION);
   ```
   This value comes back as hdfs://localhost/, which causes HDFS to try to connect on port 8020 (the Hadoop default), resulting in the following error:
   ```
   Connecting to DFS... Connected.
   Uploading /Users/paulrogers/bin/apache-drill-1.13.0.tar.gz to /users/drill/apache-drill-1.13.0.tar.gz ... Failed.
   Failed to upload Drill archive
     Caused by: Failed to create DFS directory: /users/drill
     Caused by: Call From Pauls-MBP/192.168.1.243 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused;
   ```
   
   For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
   (Shout out here to arjun kr for suggesting we include the extra exception details; very helpful here.)
   
   The workaround is to manually change the port to 8020 in the config setting shown above.
   The full fix is to change the code to use the following line in connect():
   ```java
       String dfsConnection = FileSystem.getDefaultUri(yarnConf);
   ```
   This bug is serious because it constrains the ability of users to select non-default HDFS ports.
   
   ## Documentation
   No user facing changes. 
   
   ## Testing
   Unit tests pass. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

Posted by GitBox <gi...@apache.org>.

paul-rogers commented on pull request #2139:
URL: https://github.com/apache/drill/pull/2139#issuecomment-812928381


   @cgivre, Revisiting this one. The fix is probably not correct. As explained earlier, the goal is to 1) use the YARN config by default, unless, 2) overridden in the DoY config file. Here is the default config from `drill-on-yarn-defaults.conf`:
   
   ```
   drill.yarn: {
     ...
     dfs: {
       connection: ""
       app-dir: "/user/drill"
     }
   ```
   
   The code says:
   
   ```java
       String dfsConnection = config.getString(DrillOnYarnConfig.DFS_CONNECTION);
       try {
         if (DoYUtil.isBlank(dfsConnection)) {
           fs = FileSystem.get(yarnConf);
   ```
   
   So, if the `dfs.connection` property is blank, use the one from the YARN config file.
   
   Again, why might there be a different DoY value? Because some users push apps to multiple servers, and the DoY config should be sufficient to do so, without having to have multiple different YARN configs available. (If, in practice, people use only one config, we can remove these DoY configs if not needed. But, let's assume they are needed.)
   
   So, the question is, why did the user see the bug which was reported? Where did the `"hdfs://localhost/"` value come from? **That** is the bug we need to fix.
   
   The answer seems to be that someone used `drill-on-yarn-example.conf` as their config, without inspecting if the *example* values are useful. (This is an *example*, not a *default*.):
   
   ```
   drill.yarn: {
     ...
     dfs: {
       # Connection to the distributed file system. Defaults to work with
       # a single-node Drill on the local machine.
       # Omit this if you want to get the configuration either from the
       # Hadoop config (set with config-dir above) or from the
       # $DRILL_HOME/core-site.xml.
   
       connection: "hdfs://localhost/"
   ```
   
   Why is that being used? The proper "default" file is `drill-on-yarn-override.conf` from `distribution`. But, it looks like the `component.xml` file is missing a line. So, maybe the user renamed the example file to `drill-on-yarn-override.conf`. We need:
   
   ```xml
       <file>
         <source>src/main/resources/drill-on-yarn-override.conf</source>
         <outputDirectory>conf</outputDirectory>
         <fileMode>0640</fileMode>
       </file>
   ```
   
   With the above Maven fix, we don't need to change the code: the code does what it is supposed to do, if given a proper (blank) config entry.
   
   An "extra for experts" fix is to add the updated port number to the example file above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] cgivre commented on pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

Posted by GitBox <gi...@apache.org>.

cgivre commented on pull request #2139:
URL: https://github.com/apache/drill/pull/2139#issuecomment-846099484


   @paul-rogers 
   If I'm understanding you correctly, it sounds like the correct edits for this PR is that I need to:
   1.  Modify `component.xml` as noted above.
   2. Add some documentation to explain how Drill is getting the config info.
   
   Does that seem correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] paul-rogers commented on pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

Posted by GitBox <gi...@apache.org>.

paul-rogers commented on pull request #2139:
URL: https://github.com/apache/drill/pull/2139#issuecomment-759855911


   @vdiravka, thank you for taking a look at this issue. It has been a while since I wrote this stuff. I've poked around a bit to refresh my memory of what we were trying to do.
   
   The goal is to have a config system that draws from both the HDFS config and from the DoY config files. When this was written, Drill did not use the HDFS config files; the config info was stored by Drill in ZK. Also, during testing, we might target any number of HDFS systems, so we had to support more than just the local HDFS config file.
   
   Those considerations led us to use a combination of the Drill-style config file and the HDFS config. Note that the DoY config is *not* the same as the Drill config (though both use the same HOCON library.) There is a chicken-and-egg problem: for the DoY client, Drill is a zip file; there is no actual Drill installed on the machine acting as the DoY client. Also, a single DoY client can run any number of Drill instances. Hence, DoY does not read a Drill config file; the DoY config is a separate entity.
   
   With that background, let's look at the [config file](https://github.com/apache/drill/blob/master/drill-yarn/src/main/resources/org/apache/drill/yarn/core/drill-on-yarn-defaults.conf):
   
   ```
   drill.yarn: {
     app-name: "Drill-on-YARN"
   
     # Settings here support a default single-node cluster on the local host,
     # using the default HDFS connection obtained from the Hadoop config files.
   
     dfs: {
       connection: ""
       app-dir: "/user/drill"
     }
     ...
   ```
   
   Note that `drill.yarn.dfs.connection` is supposed to the the target DFS connection. So, if you set that to `hdfs://localhost:9000`, things should work.
   
   You noted that we obtain the information twice. The code in [DfsFacade.connect()](https://github.com/apache/drill/blob/master/drill-yarn/src/main/java/org/apache/drill/yarn/core/DfsFacade.java) appears correct:
   
   ```
     public void connect() throws DfsFacadeException {
       loadYarnConfig();
       String dfsConnection = config.getString(DrillOnYarnConfig.DFS_CONNECTION);
       try {
         if (DoYUtil.isBlank(dfsConnection)) {
           fs = FileSystem.get(yarnConf);
         } else {
           URI uri;
           try {
             uri = new URI(dfsConnection);
           } catch (URISyntaxException e) {
             throw new DfsFacadeException(
                 "Illformed DFS connection: " + dfsConnection, e);
           }
           fs = FileSystem.get(uri, yarnConf);
         }
       } catch (IOException e) {
         throw new DfsFacadeException("Failed to create the DFS", e);
       }
     }
   ```
   
   We first use the value from the DoY config. If not set, we fall back to the Hadoop config. This behavior gives us what we want: DoY config first, else fall back to the Hadoop config.
   
   The initial note also mentions `loadYarnConfig()`. This method loads, well, the YARN config. The original thought, IIRC, is that the YARN config points to the YARN services; not the HDFS services. Also, the idea was that DoY works with a single YARN instance, identified by its YARN config. (Now, in reality, I suppose that there must be only one HDFS per YARN.) So, it should not be the case that the YARN config breaks HDFS.
   
   Is it the case, in this instance, that the HDFS server is configured in YARN, but not in the HDFS config files?
   
   At this point, I suspect I've exhausted my YARN knowledge. I can, however, offer a suggestion.
   
   If the YARN config holds (or reads) the HDFS config more accurately than the default HDFS config, then change the `connect()` method above to use the (cached) YARN config.
   
   Before doing this, I'd recommend researching these issues a bit more in the YARN and HDFS configs. Maybe that fs URI check in `loadYarnConfig()` is wrong.
   
   It may also be that, to use YARN, you must have a valid YARN config and a valid HDFS config to go with it, so we might not need the DoY connect config. (I suspect this config was added, in part, because MapRFS didn't use the HDFS configs, but I could be wrong.)
   
   So, perhaps do a bit more homework, then we can refine the fix.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [drill] vdiravka commented on pull request #2139: DRILL-6268: Drill-on-YARN client obtains HDFS URL incorrectly

Posted by GitBox <gi...@apache.org>.

vdiravka commented on pull request #2139:
URL: https://github.com/apache/drill/pull/2139#issuecomment-759816691


   @paul-rogers Could you check please? I am going to check it on the cluster. Are there any specific cases? Non default hadoop ports, possibly modified yarn config file should be checked or smth else?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org