You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by GitBox <gi...@apache.org> on 2021/04/19 03:25:20 UTC

[GitHub] [zeppelin] zjffdu opened a new pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

zjffdu opened a new pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097


   ### What is this PR for?
   A few sentences describing the overall goals of the pull request's commits.
   First time? Check out the contributing guide - https://zeppelin.apache.org/contribution/contributions.html
   
   
   ### What type of PR is it?
   [Bug Fix | Improvement | Feature | Documentation | Hot Fix | Refactoring]
   
   ### Todos
   * [ ] - Task
   
   ### What is the Jira issue?
   * Open an issue on Jira https://issues.apache.org/jira/browse/ZEPPELIN/
   * Put link here, and add [ZEPPELIN-*Jira number*] in PR title, eg. [ZEPPELIN-533]
   
   ### How should this be tested?
   * Strongly recommended: add automated unit tests for any new or changed behavior
   * Outline any manual steps to test the PR here.
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * Does the licenses files need update?
   * Is there breaking changes for older versions?
   * Does this needs documentation?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-854380073


   CI is fixed now, ready for review @Reamer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r655220707



##########
File path: zeppelin-server/src/main/java/org/apache/zeppelin/socket/NotebookSocket.java
##########
@@ -67,7 +67,7 @@ public String getProtocol() {
   }
 
   public synchronized void send(String serializeMessage) throws IOException {
-    connection.getRemote().sendStringByFuture(serializeMessage);
+    connection.getRemote().sendString(serializeMessage);

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-842054131


   @Reamer Actually `YarnRemoteInterpreterProcess` doesn't do the downloading, it just upload the conda to hdfs as yarn app resource, and yarn will download it from hdfs before starting yarn container. 
   There're 2 benefits of using yarn app resource:
   * Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it in local environment and then move it to yarn environment in production environment.
   * We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in `JupyterKernelInterpreter.java`, it may cause network congestion if many python interpreters runs at the same time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-833645596


   > At the moment I don't know if `spark.archives` supports a download via HTTP. I will find out as soon as possible. If this is possible, `python.archive` should also do the download so that you don't have to pack python (conda) environments several times.
   
   I can confirm, that `spark.archives` supports HTTP as a source.
   The following Spark command was successfully executed in my K8s cluster.
   ```
   PYSPARK_PYTHON=./environment/bin/python ./bin/spark-submit --deploy-mode cluster --archives https://leandi.my_company.de/nexus/service/local/repositories/de.my_company.it.bes.3rdparty/content/de/my_company/conda/my_company-test/1.0/my_company-test-1.0.tar.gz#environment examples/src/main/python/pi.py
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] cuspymd commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
cuspymd commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r616816582



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -242,13 +244,15 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
 
     // Set the resources to localize
     this.stagingDir = new Path(fs.getHomeDirectory() + "/.zeppelinStaging", appId.toString());
+    LOGGER.info("Use staging directory: {}", this.stagingDir);
     Map<String, LocalResource> localResources = new HashMap<>();
 
     File interpreterZip = createInterpreterZip();
     Path srcPath = localFs.makeQualified(new Path(interpreterZip.toURI()));
     Path destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
     addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "zeppelin");
-    FileUtils.forceDelete(interpreterZip);
+    LOGGER.info("Add zeppelin archive: {}", destPath);
+    //FileUtils.forceDelete(interpreterZip);

Review comment:
       Does it mean that commenting without erasing is a temporary modification?

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -266,6 +270,30 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (org.apache.commons.lang3.StringUtils.isNotBlank(yarnDistArchives)) {

Review comment:
       In the file, `org.apache.commons.lang3.StringUtils` is used 7 times, and `org.apache.hadoop.util.StringUtils` is used 1 time. I think it would be okay to use `org.apache.commons.lang3.StringUtils` in short form.

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -316,6 +344,25 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
     return amContainer;
   }
 
+  public URI resolveURI(String path) {

Review comment:
       public?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-827815927


   You are right, putting the Conda environment in a cloud storage will be the best option. Do you have any idea what possibilities for integration `spark.archives` supports? Local mounting via filesystem is not an option in Kubernetes. I am hoping for an HTTP endpoint, which is very flexible and should work for most users.
   YARN should also be fine with an HTTP endpoint, so that the Conda environment can be dynamically loaded by the Zeppelin interpreter when the Python or PySpark interpreter is started.
   I think it would be great if the Zeppelin interpreter, independent of YARN and K8s, load the conda environment via the same mechanism.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824603467


   > Updates of the image with a changed Python version and libraries are not really possible because you don't know which other notebooks might be broken afterwards.
   
   I mean allowing user to choose different image for their notes. Updating image won't be a frequent operation IIUC
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r616830363



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -316,6 +344,25 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
     return amContainer;
   }
 
+  public URI resolveURI(String path) {

Review comment:
       change it to private now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-825479159


   > Although the image size may look a little big, the physical size or downloading size would be the same of using the current python interpreter image and then download the python environment via whatever approach.
   
   You are right, the final disk size would be the same. It is even possible that it is lower, because if the same Python interpreter is started twice, they both use the same image layer.
   But the image approach has big disadvantages in my eyes.
   1) If we but the python environment (conda environment) into the image, then the user must select the image. I don't currently know of any way to check whether the image is present, which makes error handling in K8s very difficult. I think the error handling inside the zeppelin interpreter is much easier.
   2) For pyspark you need to create two images. A Zeppelin interpreter image (with the Python environment, Spark and the Zeppelin interpreter) and a Spark executor image (with the Python environment and Spark).
   
   Do you know if it is possible to set a URL for `spark.archives`? If not, then there is no other way to put the specific Python environment into the image.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-842054131


   @Reamer Actually `YarnRemoteInterpreterProcess` doesn't do the downloading, it just upload the conda to hdfs as yarn app resource, and yarn will download it from hdfs before starting yarn container. 
   There're 2 benefits of using yarn app resource:
   * Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it and then move it to yarn environment in production environment.
   * We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in `JupyterKernelInterpreter.java`, it may cause network congestion if many python interpreters runs at the same time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r652430602



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnLauncherUtil.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zeppelin.interpreter.launcher;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.File;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+public class YarnLauncherUtil {
+
+  private static final Logger LOGGER = LoggerFactory.getLogger(YarnLauncherUtil.class);
+
+  public static URI resolveURI(String path) {

Review comment:
       @Reamer You are right, `URI uri = new URI(path)` is sufficient, I have updated it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824736245


   > I mean allowing user to choose different image for their notes. Customizing image won't be a frequent operation IIUC
   
   Maintaining different Zeppelin interpreter images is quite difficult. I would like to see the user (download and) activate a specific conda environment in his note.
   The main advantage is that the image of the Zeppelin interpreter would be much smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-836766584


   > What do you think ?
   
   Can we move the Conda environment download from `YarnRemoteInterpreterProcess.java` to the actual Python interpreter. I think that would be `JupyterKernelInterpreter.java`. This would solve the download for all launchers.
   Does this approach work with YARN?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-854380073


   CI is fixed now, ready for review @Reamer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-843240046


   Hi @zjffdu,
   
   > We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in JupyterKernelInterpreter.java, it may cause network congestion if many python interpreters runs at the same time.
   
   I was aware of this, but it seems that downloading dependencies several times is the way of `spark.archives`. It is clear that this is not optimal.
   
   > Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it in local environment and then move it to yarn environment in production environment.
   
   By local, do you mean the local file system of the Zeppelin server? In my environment, the Zeppelin user does not have access to the local file system of the Zeppelin server. Therefore, I would prefer a remote endpoint that is under the control of the Zeppelin user.
   I understand your development approach and it sounds great, but I think this is not suitable for a production environment. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-825500280


   Thanks @Reamer `spark.archives` seems to be able to work both in yarn and k8s, let me try this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r655220571



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {
+          // zeppelin.yarn.dist.archives is local file
+          srcPath = localFs.makeQualified(new Path(distArchiveURI));
+          destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
+        } else {
+          // zeppelin.yarn.dist.archives is files on any hadoop compatible file system
+          destPath = new Path(removeLink(distArchive));

Review comment:
       Fixed

##########
File path: zeppelin-server/src/main/java/org/apache/zeppelin/socket/NotebookSocket.java
##########
@@ -67,7 +67,7 @@ public String getProtocol() {
   }
 
   public synchronized void send(String serializeMessage) throws IOException {
-    connection.getRemote().sendStringByFuture(serializeMessage);
+    connection.getRemote().sendString(serializeMessage);

Review comment:
       Fixed

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-830638631


   For spark interpreter, we can leverage `spark.archives` to download and setup conda environment in both driver(spark interpreter) and executor. But for python interpreter, I don't think there's unified approach for that for now. Buy we can introduce unified configuration for that. e.g. We can introduce `python.archive` which will be translated to yarn/k8s specific configuration. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r655222659



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-864889515


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-843240046


   Hi @zjffdu,
   
   > We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in JupyterKernelInterpreter.java, it may cause network congestion if many python interpreters runs at the same time.
   
   I was aware of this, but it seems that downloading dependencies several times is the way of `spark.archives`. It is clear that this is not optimal.
   
   > Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it in local environment and then move it to yarn environment in production environment.
   
   By local, do you mean the local file system of the Zeppelin server? In my environment, the Zeppelin user does not have access to the local file system of the Zeppelin server. Therefore, I would prefer a remote endpoint that is under the control of the Zeppelin user.
   I understand your development approach and it sounds great, but I think this is not suitable for a production environment. 
   
   Maybe we can support the download in `JupyterKernelInterpreter.java` with an additional property. Then it should not matter whether the files were provided by YARN or the download.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-843283503


   
   > I was aware of this, but it seems that downloading dependencies several times is the way of `spark.archives`. It is clear that this is not optimal.
   
   If I read spark code correctly, spark only download `spark.archives` in driver side and distribute to executors via its internal mechanism between driver and executor. 
   
   > By local, do you mean the local file system of the Zeppelin server? In my environment, the Zeppelin user does not have access to the local file system of the Zeppelin server. Therefore, I would prefer a remote endpoint that is under the control of the Zeppelin user.
   > I understand your development approach and it sounds great, but I think this is not suitable for a production environment.
   
   Let me clarify it, actually it is not only local file system, it could be any hadoop compatible file system, such as hdfs, s3. 
   
   > Maybe we can support the download in `JupyterKernelInterpreter.java` with an additional property. Then it should not matter whether the files were provided by YARN or the download.
   
   Actually for spark yarn mode, it would still use yarn cache mechanism to distribute archives [1]. Of course, for k8s mode, we can use other property to download the archive in `JupyterKernelInterpreter.java` for pure python interpreter, but for pyspark, it is not necessary, because it is already done by SparkSubmit [2]
   
   * [1] https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1663
   * [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L391
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-826297779


   @Reamer `spark.archives` works, but it is only available after spark 3.1, and I think it is better to put conda env in cloud storage and then specify it via `spark.archives` (because we spark-submit runs in pod so we can not specify local file as `spark.archives`. Do you think whether it work for users ?
   
   And I think this PR is orthogonal to `spark.archives`, because `spark.archives` will control both the python environment of driver and executor. The executor is out of control of zeppelin.  That's why this approach here only works for python interpreter which zeppelin can control its python environment. But of course we should make the configuration between spark and python interpreter as similar as possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-852767277


   CI is failed, I am looking at that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-864567409


   @Reamer Do you have any more comment ? Otherwise I will merge it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824601359


   @Reamer Actually the approach here also apply for pyspark interpreter but with different setting, I have verified that 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-836747887


   Thanks @Reamer for the investigation, so how about using `python.archives` for specifying conda env archives in both yarn and k8s, in this PR, I will make it work in yarn and we can create another ticket to support k8s. What do you think ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824057131


   I think this is also a great feature for the K8s launcher. Unfortunately, I don't see any way to manipulate files before launching, like you did with YARN. Any ideas how we can make it more general so that it works with other launchers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r616831897



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -266,6 +270,30 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (org.apache.commons.lang3.StringUtils.isNotBlank(yarnDistArchives)) {

Review comment:
       Good point, fixed it now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824603467


   > Updates of the image with a changed Python version and libraries are not really possible because you don't know which other notebooks might be broken afterwards.
   
   I mean allowing user to choose different image for their notes. Customizing image won't be a frequent operation IIUC
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824601359


   @Reamer Actually the approach here also apply for pyspark interpreter, I have verified that 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-825324712


   > > I mean allowing user to choose different image for their notes. Customizing image won't be a frequent operation IIUC
   > 
   > Maintaining different Zeppelin interpreter images is quite difficult. I would like to see the user (download and) activate a specific conda environment in his note.
   > The main advantage is that the image of the Zeppelin interpreter would be much smaller.
   
   But these python interpreter images can share them same base image and add another layer with different python environment. Although the image size may look a little big, the physical size or downloading size would be the same of using the current python interpreter image and then download the python environment via whatever approach. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r636668467



##########
File path: zeppelin-jupyter-interpreter/src/main/java/org/apache/zeppelin/jupyter/JupyterKernelInterpreter.java
##########
@@ -183,8 +195,43 @@ public String checkKernelPrerequisite(String pythonExec) {
     return "";
   }
 
+  private void activateCondaEnv(String envName) throws IOException {
+    LOGGER.info("Activating conda env: {}", envName);
+    ByteArrayOutputStream stdout = new ByteArrayOutputStream();
+    PumpStreamHandler psh = new PumpStreamHandler(stdout);
+    try {
+      if (!new File(envName).exists()) {
+        throw new IOException("Fail to activating conda env because no environment folder: " +
+                envName);
+      }
+      File scriptFile = Files.createTempFile("zeppelin_jupyter_kernel_", ".sh").toFile();
+      try (FileWriter writer = new FileWriter(scriptFile)) {
+        IOUtils.write(String.format("chmod 777 -R %s \nsource %s/bin/activate \nconda-unpack",
+                envName, envName),
+                writer);
+      }
+      scriptFile.setExecutable(true, false);
+      scriptFile.setReadable(true, false);
+      CommandLine cmd = new CommandLine(scriptFile.getAbsolutePath());
+      DefaultExecutor executor = new DefaultExecutor();
+      executor.setStreamHandler(psh);
+      int exitCode = executor.execute(cmd);
+      if (exitCode != 0) {
+        throw new IOException("Fail to activate conda env, " + stdout.toString());
+      } else {
+        LOGGER.info("Activate conda env successfully");
+        this.condaEnv = envName;
+        this.pythonExecutable = envName + "/bin/python";
+      }
+    } catch (Exception e) {

Review comment:
       This should be changed to specific exceptions. More at [RSPEC-2221](https://rules.sonarsource.com/java/tag/error-handling/RSPEC-2221)

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -316,6 +347,25 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
     return amContainer;
   }
 
+  private URI resolveURI(String path) {

Review comment:
       It would be good if we could add tests with high coverage for this function, as the process is a bit special




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-825479159


   > Although the image size may look a little big, the physical size or downloading size would be the same of using the current python interpreter image and then download the python environment via whatever approach.
   
   You are right, the final disk size would be the same. It is even possible that it is lower, because if the same Python interpreter is started twice, they both use the same container layer.
   But the image approach has big disadvantages in my eyes.
   1) If we but the python environment (conda environment) into the image, then the user must select the image. I don't currently know of any way to check whether the image is present, which makes error handling in K8s very difficult. I think the error handling inside the zeppelin interpreter is much easier.
   2) For pyspark you need to create two images. A Zeppelin interpreter image (with the Python environment, Spark and the Zeppelin interpreter) and a Spark executor image (with the Python environment and Spark).
   
   Do you know if it is possible to set a URL for `spark.archives`? If not, then there is no other way to put the specific Python environment into the image.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r655220571



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {
+          // zeppelin.yarn.dist.archives is local file
+          srcPath = localFs.makeQualified(new Path(distArchiveURI));
+          destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
+        } else {
+          // zeppelin.yarn.dist.archives is files on any hadoop compatible file system
+          destPath = new Path(removeLink(distArchive));

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r649650890



##########
File path: zeppelin-plugins/launcher/yarn/src/test/java/org/apache/zeppelin/interpreter/launcher/YarnLauncherUtilTest.java
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zeppelin.interpreter.launcher;
+
+import org.junit.Test;
+
+import java.net.URI;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+
+public class YarnLauncherUtilTest {
+
+  @Test
+  public void testURIUtil() {
+    URI uri = YarnLauncherUtil.resolveURI("/tmp/env_1");
+    assertEquals("file", uri.getScheme());
+    assertEquals("/tmp/env_1", uri.getPath());
+    assertNull(uri.getFragment());
+
+    uri = YarnLauncherUtil.resolveURI("/tmp/env_1#env");

Review comment:
       Fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-842054131


   @Reamer Actually `YarnRemoteInterpreterProcess` doesn't do the downloading, it just upload the conda to hdfs as yarn app resource, and yarn will download it from hdfs before starting yarn container. 
   There're 2 benefits of using yarn app resource:
   * Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it in local environment and then move it to yarn environment in production environment.
   * We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in `JupyterKernelInterpreter.java`, it may cause network congestion if many python interpreters runs at the same time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer removed a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer removed a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-864889515


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-827815927


   You are right, putting the Conda environment in a cloud storage will be the best option. Do you have any idea what possibilities for integration `spark.archives` supports? Local mounting via filesystem is not an option in Kubernetes. I am hoping for an HTTP endpoint, which is very flexible and should work for most users.
   YARN should also be fine with an HTTP endpoint, so that the Conda environment can be dynamically loaded by the Zeppelin interpreter when the Python or PySpark interpreter is started.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] asfgit closed pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824535028


   @Reamer Regarding your experience, would building conda env into docker image works for users ? 
   User can use a base docker image with lots of common used python packages for their python interpreter. If they want to use other packages, they can use `!pip install` in zeppelin to install packages


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-831361414


   You are right, for the pyspark zeppelin interpreter we should use `spark.archives` to enable the python (conda) environment.
   For the python zeppelin interpreter we should use a configuration parameter that does almost the same as the spark equilant.
   
   > But for python interpreter, I don't think there's unified approach for that for now. Buy we can introduce unified configuration for that. e.g. We can introduce `python.archive` which will be translated to yarn/k8s specific configuration.
   
   The current approach seems to load the conda environment into the HDFS, which seems to be quite effective as the Zeppelin server and the Zeppelin interpreter process within YARN share the same content.
   A common way for docker, K8s and YARN could be a dynamic download starting from the Zeppelin interpreter just before it starts the Python process.
   At the moment I don't know if `spark.archives` supports a download via HTTP. I will find out as soon as possible. If this is possible, `python.archive` should also do the download so that you don't have to pack python (conda) environments several times.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-826297779


   @Reamer `spark.archives` works, but it is only available after spark 3.1, and I think it is better to put conda env in cloud storage and then specify it via `spark.archives` (because we spark-submit runs in pod so we can not specify local file as `spark.archives`. Do you think whether it work for users ?
   
   And I think this PR is orthogonal to `spark.archives`, because `spark.archives` will control both the python environment of driver and executor. The executor is out of control of zeppelin.  That's why this approach here only works for python interpreter which zeppelin can control its python environment. But of course we should make the configuration between spark and python interpreter as similar as possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r616830110



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -242,13 +244,15 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
 
     // Set the resources to localize
     this.stagingDir = new Path(fs.getHomeDirectory() + "/.zeppelinStaging", appId.toString());
+    LOGGER.info("Use staging directory: {}", this.stagingDir);
     Map<String, LocalResource> localResources = new HashMap<>();
 
     File interpreterZip = createInterpreterZip();
     Path srcPath = localFs.makeQualified(new Path(interpreterZip.toURI()));
     Path destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
     addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "zeppelin");
-    FileUtils.forceDelete(interpreterZip);
+    LOGGER.info("Add zeppelin archive: {}", destPath);
+    //FileUtils.forceDelete(interpreterZip);

Review comment:
       Right, I have uncommented it now. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r655079099



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {
+          // zeppelin.yarn.dist.archives is local file
+          srcPath = localFs.makeQualified(new Path(distArchiveURI));
+          destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
+        } else {
+          // zeppelin.yarn.dist.archives is files on any hadoop compatible file system
+          destPath = new Path(removeLink(distArchive));

Review comment:
       `removeLink` sounds wrong at this point. I think `removeFragment` sounds better.
   https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax

##########
File path: zeppelin-server/src/main/java/org/apache/zeppelin/socket/NotebookSocket.java
##########
@@ -67,7 +67,7 @@ public String getProtocol() {
   }
 
   public synchronized void send(String serializeMessage) throws IOException {
-    connection.getRemote().sendStringByFuture(serializeMessage);
+    connection.getRemote().sendString(serializeMessage);

Review comment:
       This line should not be part of this PR.

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -259,13 +266,48 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
       FileUtils.forceDelete(flinkZip);
 
       String hiveConfDir = launchContext.getProperties().getProperty("HIVE_CONF_DIR");
-      if (!org.apache.commons.lang3.StringUtils.isBlank(hiveConfDir)) {
+      if (!StringUtils.isBlank(hiveConfDir)) {
         File hiveConfZipFile = createHiveConfZip(new File(hiveConfDir));
         srcPath = localFs.makeQualified(new Path(hiveConfZipFile.toURI()));
         destPath = copyFileToRemote(stagingDir, srcPath, (short) 1);
         addResource(fs, destPath, localResources, LocalResourceType.ARCHIVE, "hive_conf");
       }
     }
+
+    String yarnDistArchives = launchContext.getProperties().getProperty("zeppelin.yarn.dist.archives");
+    if (StringUtils.isNotBlank(yarnDistArchives)) {
+      for (String distArchive : yarnDistArchives.split(",")) {
+        URI distArchiveURI = null;
+        try {
+          distArchiveURI = new URI(distArchive);
+        } catch (URISyntaxException e) {
+          throw new IOException("Invalid uri: " + distArchive, e);
+        }
+        if (distArchiveURI.getScheme() == null || "file".equals(distArchiveURI.getScheme())) {

Review comment:
       A null check is not required at this point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] asfgit closed pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824591596


   > Regarding your experience, would building conda env into docker image works for users ?
   
   It works, but it's not as flexible as your approach with YARN. At the moment it's just an image with a Conda environment with a fixed set of Python libraries and a fixed Python version.
   Updates of the image with a changed Python version and libraries are not really possible because you don't know which other notebooks might be broken afterwards.
   
   The same problem exists in pyspark, so I have been thinking for quite a long time how we can activate a similar functionality there.
   I found this blog post which is very interesting.
   https://databricks.com/de/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-845683022


   Thank you for the clarification. I have created a [Jira ticket](https://issues.apache.org/jira/browse/ZEPPELIN-5384).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r643680742



##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnRemoteInterpreterProcess.java
##########
@@ -316,6 +347,25 @@ private ContainerLaunchContext setUpAMLaunchContext() throws IOException {
     return amContainer;
   }
 
+  private URI resolveURI(String path) {

Review comment:
       Addressed

##########
File path: zeppelin-jupyter-interpreter/src/main/java/org/apache/zeppelin/jupyter/JupyterKernelInterpreter.java
##########
@@ -183,8 +195,43 @@ public String checkKernelPrerequisite(String pythonExec) {
     return "";
   }
 
+  private void activateCondaEnv(String envName) throws IOException {
+    LOGGER.info("Activating conda env: {}", envName);
+    ByteArrayOutputStream stdout = new ByteArrayOutputStream();
+    PumpStreamHandler psh = new PumpStreamHandler(stdout);
+    try {
+      if (!new File(envName).exists()) {
+        throw new IOException("Fail to activating conda env because no environment folder: " +
+                envName);
+      }
+      File scriptFile = Files.createTempFile("zeppelin_jupyter_kernel_", ".sh").toFile();
+      try (FileWriter writer = new FileWriter(scriptFile)) {
+        IOUtils.write(String.format("chmod 777 -R %s \nsource %s/bin/activate \nconda-unpack",
+                envName, envName),
+                writer);
+      }
+      scriptFile.setExecutable(true, false);
+      scriptFile.setReadable(true, false);
+      CommandLine cmd = new CommandLine(scriptFile.getAbsolutePath());
+      DefaultExecutor executor = new DefaultExecutor();
+      executor.setStreamHandler(psh);
+      int exitCode = executor.execute(cmd);
+      if (exitCode != 0) {
+        throw new IOException("Fail to activate conda env, " + stdout.toString());
+      } else {
+        LOGGER.info("Activate conda env successfully");
+        this.condaEnv = envName;
+        this.pythonExecutable = envName + "/bin/python";
+      }
+    } catch (Exception e) {

Review comment:
       Addressed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-842054131


   @Reamer Actually `YarnRemoteInterpreterProcess` doesn't do the downloading, it just upload the conda to hdfs as yarn app resource, and yarn will download it from hdfs before starting yarn container. 
   There're 2 benefits of using yarn app resource:
   * Don't need to update conda archives to hdfs, just use the local file system to store the conda env. This would make the development much smooth, user use the local conda env to verify it and then move it to yarn environment in production environment.
   * We can leverage yarn's resource cache mechanism. That means the same conda env downloaded by yarn_app_1 can be reused by yarn_app_2. If we download it in `JupyterKernelInterpreter.java`, it may cause network congestion if many python interpreters runs at the same time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] zjffdu edited a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
zjffdu edited a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-824603467


   > Updates of the image with a changed Python version and libraries are not really possible because you don't know which other notebooks might be broken afterwards.
   
   I mean allowing user to choose different image for their notes. Choose different image won't be a frequent operation IIUC
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer removed a comment on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer removed a comment on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-864889515


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#issuecomment-864889515


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [zeppelin] Reamer commented on a change in pull request #4097: [ZEPPELIN-5330]. Support conda env for python interpreter in yarn mode

Posted by GitBox <gi...@apache.org>.
Reamer commented on a change in pull request #4097:
URL: https://github.com/apache/zeppelin/pull/4097#discussion_r646271563



##########
File path: zeppelin-jupyter-interpreter/src/main/java/org/apache/zeppelin/jupyter/JupyterKernelInterpreter.java
##########
@@ -216,6 +259,23 @@ private void launchJupyterKernel(int kernelPort)
     }
   }
 
+  private File buildBootstrapScriptFile(int kernelPort) throws IOException {
+    StringBuilder builder = new StringBuilder();
+    if (condaEnv != null) {
+      builder.append("source " + condaEnv + "/bin/activate\n");
+    }
+    builder.append(pythonExecutable);
+    builder.append(" " + kernelWorkDir.getAbsolutePath() + "/kernel_server.py");

Review comment:
       If you use the StringBuilder, you should use the String Builder for the entire string.
   ```
   builder.append(" ").append(kernelWorkDir.getAbsolutePath()).append("/kernel_server.py");
   ...
   ```

##########
File path: zeppelin-plugins/launcher/yarn/src/main/java/org/apache/zeppelin/interpreter/launcher/YarnLauncherUtil.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zeppelin.interpreter.launcher;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.File;
+import java.net.URI;
+import java.net.URISyntaxException;
+
+public class YarnLauncherUtil {
+
+  private static final Logger LOGGER = LoggerFactory.getLogger(YarnLauncherUtil.class);
+
+  public static URI resolveURI(String path) {

Review comment:
       Some javadoc would be nice.
   Please describe why `URI uri = new URI(path);` is not sufficient.

##########
File path: zeppelin-plugins/launcher/yarn/src/test/java/org/apache/zeppelin/interpreter/launcher/YarnLauncherUtilTest.java
##########
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zeppelin.interpreter.launcher;
+
+import org.junit.Test;
+
+import java.net.URI;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+
+public class YarnLauncherUtilTest {
+
+  @Test
+  public void testURIUtil() {
+    URI uri = YarnLauncherUtil.resolveURI("/tmp/env_1");
+    assertEquals("file", uri.getScheme());
+    assertEquals("/tmp/env_1", uri.getPath());
+    assertNull(uri.getFragment());
+
+    uri = YarnLauncherUtil.resolveURI("/tmp/env_1#env");

Review comment:
       Thank you for the tests. I think we should add some more tests.
   1) a string with a scheme
   2) a relative path




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org