You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/05 06:19:43 UTC

[GitHub] [spark] sarutak opened a new pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

sarutak opened a new pull request #32052:
URL: https://github.com/apache/spark/pull/32052


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR fixes an issue that `ADD JAR` command can't add jar files whose name contain whitespaces though `ADD FILE` and `ADD ARCHIVE` work with such files.
   
   If we have `/some/path/test file.jar` and execute the following command:
   
   ```
   ADD JAR "/some/path/test file.jar";
   ```
   
   The following exception is thrown.
   
   ```
   21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"]
   java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar
   	at java.net.URI.create(URI.java:852)
   	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
   	at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
   ```
   
   This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form.
   Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected.
   We can resolve this part by checking whether the given path is URI form or not.
   
   Unfortunatelly, if we fix this part, another problem occurs.
   When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked.
   In `AddResourceProcessor.run`, the command line is just split by `
   s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`.
   https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
   So, the command still fails.
   
   Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command:
   
   ```
   ADD JAR "file:/some/path/test%20file";
   ```
   
   The following exception is thrown.
   
   ```
   21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist
   java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist
   	at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
   	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
   	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
   	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
   	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
   	at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)
   ```
   
   The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`.
   https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064
   
   `fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`.
   The string representation of `Path` is similar to URI but it's actually different.
   `Path` doesn't encode the given path.
   For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false.
   
   So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`.
   I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution.
   https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103
   
   But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way.
   https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641
   
   In the current implementation, such class loaders seem to be isolated by `SharedState.jarClassLoader` and `IsolatedClientLoader.classLoader`.
   
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188
   https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967
   
   So I wonder we can remove Hive's `ADD JAR` from `HiveClientImpl.addJar`.
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   This is a bug.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813345635


   **[Test build #136903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136903/testReport)** for PR 32052 at commit [`c88104e`](https://github.com/apache/spark/commit/c88104e6fd15e1f8284e4ea2506d86e5a7ca9d12).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #32052:
URL: https://github.com/apache/spark/pull/32052


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813252175






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813233318


   **[Test build #136903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136903/testReport)** for PR 32052 at commit [`c88104e`](https://github.com/apache/spark/commit/c88104e6fd15e1f8284e4ea2506d86e5a7ca9d12).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-815420143


   I agree with this change. Thanks @sarutak and @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813359546


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136903/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813259548


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41480/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813359546


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136903/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813233318


   **[Test build #136903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136903/testReport)** for PR 32052 at commit [`c88104e`](https://github.com/apache/spark/commit/c88104e6fd15e1f8284e4ea2506d86e5a7ca9d12).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32052: [SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32052:
URL: https://github.com/apache/spark/pull/32052#issuecomment-813259548


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41480/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org