You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/14 23:23:10 UTC

[GitHub] [spark] PerilousApricot edited a comment on pull request #31774: [SPARK-34659] Fix that Web UI always correctly get appId

PerilousApricot edited a comment on pull request #31774:
URL: https://github.com/apache/spark/pull/31774#issuecomment-994128564


   @gengliangwang -- I actually have a very simple reproducer using nginx as a reverse proxy and not jupyterhub (to eliminate that failure mode). The following script will set up the proxy, note that it redirects `/user/PerilousApricot/proxy/4040/` to the root of the spark webUI 
   
   **proxy-fail.sh**
   ```bash
   #!/bin/bash
   
   
   cat << \EOT > nginx.conf
   user  nginx;
   worker_processes  auto;
   error_log  /var/log/nginx/error.log notice;
   pid        /var/run/nginx.pid;
   events {
       worker_connections  1024;
   }
   http {
       include       /etc/nginx/mime.types;
       default_type  application/octet-stream;
       log_format  main  '[$time_local] "$request" '
                         '$status $body_bytes_sent "$http_referer" '
                         '"$http_x_forwarded_for"';
       access_log  /dev/stdout  main;
       server {
           listen       5050;
           server_name  localhost;
           location /user/PerilousApricot/proxy/4040/ {
               error_log  /dev/stderr debug;
               proxy_pass http://localhost:4040/;
               proxy_pass_header Content-Type;
           }        
       }
   }
   
   EOT
   
   docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx
   ```
   
   Run that proxy in one terminal, then run pyspark:
   ```
   SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/proxy/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5000/user/PerilousApricot/proxy/4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/proxy/4040/ --conf spark.app.name=proxyApp
   ```
   
   
   Open `http://localhost:5050/user/PerilousApricot/proxy/4040/executors/` in a browser with "developer mode" enabled to watch the traffic come by. You will see a number of successful requests to various resources like:
   
   ```
    http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.css
    http://localhost:5050/user/PerilousApricot/proxy/4040//static/webui.js
    ```
   
   Notice, however that there is a failed request (and the reason of this PR) - 
   ```
   http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors
   ```
   
   If you run curl manually on both that URL, you can see that it fails both at the reverse proxy and at the actual webui itself:
   ```
   curl -v -o /dev/null http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors
   curl -v -o /dev/null http://localhost:4040/api/v1/applications/4040/allexecutors
   ```
   
   But if you copy-paste the appId from the spark console (in my case I have: `Spark context available as 'sc' (master = local[*], app id = local-1639522961946).`), the following two requests succeed:
   ```
   curl http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/local-1639522961946
   curl -v -o /dev/null http://localhost:4040/api/v1/applications/local-1639522961946
   ```
   
   To confirm the issue, let's restart the proxy and pyspark, but instead of proxying `/user/PerilousApricot/proxy/4040/`, let's instead proxy to `/user/PerilousApricot/yxorp/4040/` (note that there is no "proxy" in the proxied URL). First execute
   **proxy-win.sh**
   ```
   #!/bin/bash
   
   
   cat << \EOT > nginx.conf
   user  nginx;
   worker_processes  auto;
   error_log  /var/log/nginx/error.log notice;
   pid        /var/run/nginx.pid;
   events {
       worker_connections  1024;
   }
   http {
       include       /etc/nginx/mime.types;
       default_type  application/octet-stream;
       log_format  main  '[$time_local] "$request" '
                         '$status $body_bytes_sent "$http_referer" '
                         '"$http_x_forwarded_for"';
       access_log  /dev/stdout  main;
       server {
           listen       5050;
           server_name  localhost;
           location /user/PerilousApricot/yxorp/4040/ {
               #error_log  /dev/stderr debug;
               proxy_pass http://localhost:4040/;
               #proxy_redirect     off;
               proxy_pass_header Content-Type;
               #rewrite /user/PerilousApricot/yxorp/4040(/.*|$) $1  break;
           }        
       }
   }
   
   EOT
   
   docker run -it --rm=true --name spark-31174-proxy --network=host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro nginx
   ```
   
   and then run in a different terminal
   ```
   SPARK_PUBLIC_DNS=localhost:5050/user/PerilousApricot/yxorp/4040/jobs/ pyspark --conf spark.ui.reverseProxyUrl=http://localhost:5000/user/PerilousApricot/yxorp//4040/ --conf spark.driver.extraJavaOptions="-Dlog4j.debug=true" --conf spark.ui.proxyBase=/user/PerilousApricot/yxorp/4040/ --conf spark.app.name=proxyApp
   ```
   
   Open `http://localhost:5050/user/PerilousApricot/yxorp/4040//executors/` and you can see that the page renders properly. Looking at the development console, you will see that instead of attempting to open
   
   ```
   http://localhost:5050/user/PerilousApricot/proxy/4040/api/v1/applications/4040/allexecutors
   ```
   
   this version requests the status of the executors from
   ```
   http://localhost:5050/user/PerilousApricot/yxorp/4040//api/v1/applications/local-1639523380430/allexecutors
   ```
   
   I hope this is enough to show that @ornew did the right analysis -- Th fault isn't with jupyterhub, it is simply the fact that the logic that tries to look up the appId chokes if there is a path element named "proxy" in the URL.
   
   Can you please re-examine this?
   
   EDIT: I tested with spark 3.2.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org