You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/09/17 00:52:00 UTC
[jira] [Commented] (HUDI-254) Provide mechanism for installing hudi-spark-bundle onto an existing spark installation

    [ https://issues.apache.org/jira/browse/HUDI-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930985#comment-16930985 ] 

Vinoth Chandar commented on HUDI-254:
-------------------------------------

h3. Spark 2.3.3 on master 

 

Once I copy the hudi-spark-bundle (had to shade com.databricks:spark-avro* for now) to jars, I can do *a + b* 
{code:java}
root@adhoc-2:/var/hoodie/ws/docker# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  --driver-memory 1G --executor-memory 3G --num-executors 1
19/09/17 00:48:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = local-1568681334864).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)
val jsonDF = spark.read.json("file:////var/hoodie/ws/docker/demo/data/batch_1.json")
import org.apache.hudi.DataSourceReadOptions;
import org.apache.hudi.DataSourceWriteOptions;
import org.apache.spark.sql.SaveMode;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.HoodieDataSourceHelpers;
import org.apache.hadoop.fs.FileSystem;jsonDF.write.format("org.apache.hudi").
    option("hoodie.insert.shuffle.parallelism", "2").
    option("hoodie.upsert.shuffle.parallelism","2").
    option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL).
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
    option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key").
    option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date").
    option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts").
    option(HoodieWriteConfig.TABLE_NAME, "stock_ticks_derived_mor").
    mode(SaveMode.Append).
    save("file:///tmp/stock_ticks_derived_mor");spark.read.format("org.apache.hudi").load("file:///tmp/stock_ticks_derived_mor/*/*/*/*.parquet").show// Exiting paste mode, now interpreting.+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| close|      date|day|   high|               key|   low|month|   open|symbol|                 ts|volume|year|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|     20190917004922|  20190917004922_0_1|NIHD_2018-08-31 10|            2018/08/31|0488121d-4ff5-4fb...|  5.67|2018/08/31| 31|   5.67|NIHD_2018-08-31 10|  5.67|   08|   5.67|  NIHD|2018-08-31 10:29:00|  2125|2018|
  ...
|     20190917004922| 20190917004922_0_19|STAA_2018-08-31 10|            2018/08/31|0488121d-4ff5-4fb...|  47.5|2018/08/31| 31|   47.5|STAA_2018-08-31 10|  47.5|   08|   47.5|  STAA|2018-08-31 10:28:00|   800|2018|
|     20190917004922| 20190917004922_0_20|EGAN_2018-08-31 09|            2018/08/31|0488121d-4ff5-4fb...|  14.5|2018/08/31| 31|14.5999|EGAN_2018-08-31 09|  14.5|   08|14.5999|  EGAN|2018-08-31 09:57:00|  2489|2018|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
only showing top 20 rows {code}

> Provide mechanism for installing hudi-spark-bundle onto an existing spark installation
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-254
>                 URL: https://issues.apache.org/jira/browse/HUDI-254
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Spark datasource, SparkSQL Support
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> A lot of discussions around this kicked off from [https://github.com/apache/incubator-hudi/issues/869] 
> Breaking down into phases, when we drop the hudi-spark-bundle*.jar onto the `jars` folder 
>  
> a) Writing data via Hudi datasource should work 
> b)  a + Hive Sync should work
> c) Spark datasource reads should work
> d) SparkSQL on Hive synced table works 
>  
> Start with Spark 2.3 (current demo setup) and then proceed to 2.4 and iron out issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)