You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2022/10/21 07:21:01 UTC

[jira] [Updated] (HIVE-8851) Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark Branch]

     [ https://issues.apache.org/jira/browse/HIVE-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stamatis Zampetakis updated HIVE-8851:
--------------------------------------

I cleared the fixVersion field since this ticket is still open. Please review this ticket and if the fix is already committed to a specific version please set the version accordingly and mark the ticket as RESOLVED.

According to the [JIRA guidelines|https://cwiki.apache.org/confluence/display/Hive/HowToContribute] the fixVersion should be set only when the issue is resolved/closed.

> Broadcast files for small tables via SparkContext.addFile() and SparkFiles.get() [Spark Branch]
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8851
>                 URL: https://issues.apache.org/jira/browse/HIVE-8851
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Jimmy Xiang
>            Priority: Major
>             Fix For: spark-branch
>
>         Attachments: HIVE-8851.1-spark.patch, HIVE-8851.2-spark.patch
>
>
> Currently files generated by SparkHashTableSinkOperator for small tables are written directly on HDFS with a high replication factor. When map join happens, map join operator is going to load these files into hash tables. Since on multiple partitions can be process on the same worker node, reading the same set of files multiple times are not ideal. The improvment can be done by calling SparkContext.addFiles() on these files, and use SparkFiles.getFile() to download them to the worker node just once.
> Please note that SparkFiles.getFile() is a static method. Code invoking this method needs to be in a static method. This calling method needs to be synchronized because it may get called in different threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)