You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2019/09/17 00:52:00 UTC
[jira] [Commented] (HUDI-254) Provide mechanism for installing
hudi-spark-bundle onto an existing spark installation
[ https://issues.apache.org/jira/browse/HUDI-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930985#comment-16930985 ]
Vinoth Chandar commented on HUDI-254:
-------------------------------------
h3. Spark 2.3.3 on master
Once I copy the hudi-spark-bundle (had to shade com.databricks:spark-avro* for now) to jars, I can do *a + b*
{code:java}
root@adhoc-2:/var/hoodie/ws/docker# $SPARK_INSTALL/bin/spark-shell --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client --driver-memory 1G --executor-memory 3G --num-executors 1
19/09/17 00:48:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = local-1568681334864).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
/_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)
val jsonDF = spark.read.json("file:////var/hoodie/ws/docker/demo/data/batch_1.json")
import org.apache.hudi.DataSourceReadOptions;
import org.apache.hudi.DataSourceWriteOptions;
import org.apache.spark.sql.SaveMode;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.HoodieDataSourceHelpers;
import org.apache.hadoop.fs.FileSystem;jsonDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "2").
option("hoodie.upsert.shuffle.parallelism","2").
option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL).
option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key").
option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date").
option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts").
option(HoodieWriteConfig.TABLE_NAME, "stock_ticks_derived_mor").
mode(SaveMode.Append).
save("file:///tmp/stock_ticks_derived_mor");spark.read.format("org.apache.hudi").load("file:///tmp/stock_ticks_derived_mor/*/*/*/*.parquet").show// Exiting paste mode, now interpreting.+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| close| date|day| high| key| low|month| open|symbol| ts|volume|year|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
| 20190917004922| 20190917004922_0_1|NIHD_2018-08-31 10| 2018/08/31|0488121d-4ff5-4fb...| 5.67|2018/08/31| 31| 5.67|NIHD_2018-08-31 10| 5.67| 08| 5.67| NIHD|2018-08-31 10:29:00| 2125|2018|
...
| 20190917004922| 20190917004922_0_19|STAA_2018-08-31 10| 2018/08/31|0488121d-4ff5-4fb...| 47.5|2018/08/31| 31| 47.5|STAA_2018-08-31 10| 47.5| 08| 47.5| STAA|2018-08-31 10:28:00| 800|2018|
| 20190917004922| 20190917004922_0_20|EGAN_2018-08-31 09| 2018/08/31|0488121d-4ff5-4fb...| 14.5|2018/08/31| 31|14.5999|EGAN_2018-08-31 09| 14.5| 08|14.5999| EGAN|2018-08-31 09:57:00| 2489|2018|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
only showing top 20 rows {code}
> Provide mechanism for installing hudi-spark-bundle onto an existing spark installation
> --------------------------------------------------------------------------------------
>
> Key: HUDI-254
> URL: https://issues.apache.org/jira/browse/HUDI-254
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Spark datasource, SparkSQL Support
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
>
> A lot of discussions around this kicked off from [https://github.com/apache/incubator-hudi/issues/869]
> Breaking down into phases, when we drop the hudi-spark-bundle*.jar onto the `jars` folder
>
> a) Writing data via Hudi datasource should work
> b) a + Hive Sync should work
> c) Spark datasource reads should work
> d) SparkSQL on Hive synced table works
>
> Start with Spark 2.3 (current demo setup) and then proceed to 2.4 and iron out issues.
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)