You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/09/19 03:07:00 UTC
[jira] [Comment Edited] (HUDI-260) Hudi Spark Bundle does not work when passed in extraClassPath option

    [ https://issues.apache.org/jira/browse/HUDI-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933015#comment-16933015 ] 

Vinoth Chandar edited comment on HUDI-260 at 9/19/19 3:06 AM:
--------------------------------------------------------------

Tried to search around for this, it seems like any code that's used in a closure (i.e ones that spark will serialize from driver to executor) needs to be in --jars and not extraClassPath. I figure they use different class loaders. Other theory is if we use Spark Java APIs, then they need to be placed under the `jars` folder and thats the way to go.. My guess is Java lambda to scala to codegen fails someplace, if specified via extraClassPath

I am still looking.. but does not seem like an issue with how we are bundling/shading (there are no spark/scala jars there)  

Anyways, having trouble reproducing this error. For me, somehow its not even getting picked up (spark.jars works)

{code}
root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf
...
spark.driver.extraClassPath      /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar
spark.executor.extraClassPath    /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar

root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --executor-memory 3G --num-executors 1
19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = local-1568861680731).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hudi.DataSourceReadOptions;
<console>:23: error: object hudi is not a member of package org.apache
       import org.apache.hudi.DataSourceReadOptions;
                         ^

scala>
{code}

can you give me a reproducible setup on the demo containers? 


was (Author: vc):
Tried to search around for this, it seems like any code that's used in a closure (i.e ones that spark will serialize from driver to executor) needs to be in --jars and not extraClassPath. I figure they use different class loaders. I am still looking..  

Anyways, having trouble reproducing this error. For me, somehow its not even getting picked up (spark.jars works)

{code}
root@adhoc-2:/opt# cat /opt/spark/conf/spark-defaults.conf
...
spark.driver.extraClassPath      /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar
spark.executor.extraClassPath    /var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar

root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --executor-memory 3G --num-executors 1
19/09/19 02:54:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = local-1568861680731).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.hudi.DataSourceReadOptions;
<console>:23: error: object hudi is not a member of package org.apache
       import org.apache.hudi.DataSourceReadOptions;
                         ^

scala>
{code}

can you give me a reproducible setup on the demo containers? 

> Hudi Spark Bundle does not work when passed in extraClassPath option
> --------------------------------------------------------------------
>
>                 Key: HUDI-260
>                 URL: https://issues.apache.org/jira/browse/HUDI-260
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Spark datasource, SparkSQL Support
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> On EMR's side we have the same findings. *a + b + c +d* work in the following cases:
>  * The bundle jar (with databricks-avro shaded) is specified using *--jars* or *spark.jars* option
>  * The bundle jar (with databricks-avro shaded) is placed in the Spark Home jars folder i.e. */usr/lib/spark/jars* folder
> However, it does not work if the jar is specified using *spark.driver.extraClassPath* and *spark.executor.extraClassPath* options which is what EMR uses to configure external dependencies. Although we can drop the jar in */usr/lib/spark/jars* folder, but I am not sure if it is recommended because that folder is supposed to contain the jars coming from spark. Extra dependencies from users side would be better off specified through *extraClassPath* option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)