You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "Sagar Sumit (Jira)" <ji...@apache.org> on 2022/07/26 13:44:00 UTC

[jira] [Commented] (HUDI-4374) Support BULK_INSERT row-writing on streaming Dataset/DataFrame

    [ https://issues.apache.org/jira/browse/HUDI-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571438#comment-17571438 ] 

Sagar Sumit commented on HUDI-4374:
-----------------------------------

With [https://github.com/apache/hudi/pull/5470] the bulk insert works on RDD of InternalRow which avoid much of the conversion overhead from Row to RDD. We need to rethink if this is really needed. Let's evaluate how much performance gain can be had if it's done.

> Support BULK_INSERT row-writing on streaming Dataset/DataFrame 
> ---------------------------------------------------------------
>
>                 Key: HUDI-4374
>                 URL: https://issues.apache.org/jira/browse/HUDI-4374
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: spark, writer-core
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: pull-request-available, streaming
>             Fix For: 0.13.0
>
>
> With structured streaming setup, when Hudi table is written from a streaming source, then HoodieStreamingSink calls HoodieSparkSqlWriter.write(). If BULK_INSERT operation type is set, then HoodieSparkSqlWriter.write() internally calls HoodieSparkSqlWriter.bulkInsertAsRow() which does a simple df.write.format("hudi").options(...).save(). The 'write' call does not work on streaming Dataset/DataFrame.
> {code:java}
> org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame
>     at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>     at org.apache.spark.sql.Dataset.write(Dataset.scala:3377)
>     at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:557)
>     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:178)
>     at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:91)
>     at scala.util.Try$.apply(Try.scala:213)
>     at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:90)
>     at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:166)
>     at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:89) {code}
> Bulk insert can still be done by not going via the row-writing path. But, we need to fix the HoodieStreamingSink to support bulk insert via row-writing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)