You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Alexey Kudinkin (Jira)" <ji...@apache.org> on 2022/07/25 21:39:00 UTC
[jira] [Commented] (HUDI-4081) Evaluate Spark SQL vs DS performance
[ https://issues.apache.org/jira/browse/HUDI-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571108#comment-17571108 ]
Alexey Kudinkin commented on HUDI-4081:
---------------------------------------
Turns out most of the gap b/w these (~20%) is attributable to inadvertent dereferencing of the Dataset into RDD[Row], entailing the penalty of deserialization of every row. You can see that in the plans below:
Before:
!Screen Shot 2022-07-25 at 10.04.37 AM.png|width=267,height=457!
After:
!Screen Shot 2022-07-25 at 10.05.00 AM.png|width=256,height=273!
> Evaluate Spark SQL vs DS performance
> ------------------------------------
>
> Key: HUDI-4081
> URL: https://issues.apache.org/jira/browse/HUDI-4081
> Project: Apache Hudi
> Issue Type: Task
> Components: spark-sql
> Reporter: Ethan Guo
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-07-25 at 10.04.37 AM.png, Screen Shot 2022-07-25 at 10.05.00 AM.png
>
>
> In our internal benchmarks we've detected a regression in Spark SQL relative to Spark DataSource integration.
> We need to investigate and subsequently address that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)