You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2022/07/25 21:39:00 UTC

[jira] [Commented] (HUDI-4081) Evaluate Spark SQL vs DS performance

    [ https://issues.apache.org/jira/browse/HUDI-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571108#comment-17571108 ] 

Alexey Kudinkin commented on HUDI-4081:
---------------------------------------

Turns out most of the gap b/w these (~20%) is attributable to inadvertent dereferencing of the Dataset into RDD[Row], entailing the penalty of deserialization of every row. You can see that in the plans below:

 

Before:

!Screen Shot 2022-07-25 at 10.04.37 AM.png|width=267,height=457!

 

After:

!Screen Shot 2022-07-25 at 10.05.00 AM.png|width=256,height=273!

> Evaluate Spark SQL vs DS performance
> ------------------------------------
>
>                 Key: HUDI-4081
>                 URL: https://issues.apache.org/jira/browse/HUDI-4081
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: spark-sql
>            Reporter: Ethan Guo
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.12.0
>
>         Attachments: Screen Shot 2022-07-25 at 10.04.37 AM.png, Screen Shot 2022-07-25 at 10.05.00 AM.png
>
>
> In our internal benchmarks we've detected a regression in Spark SQL relative to Spark DataSource integration.
> We need to investigate and subsequently address that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)