You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lijia Liu (JIRA)" <ji...@apache.org> on 2017/10/10 05:36:00 UTC

[jira] [Created] (SPARK-22233) filter out empty InputSplit in HadoopRDD

Lijia Liu created SPARK-22233:
---------------------------------

             Summary: filter out empty InputSplit in HadoopRDD
                 Key: SPARK-22233
                 URL: https://issues.apache.org/jira/browse/SPARK-22233
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.2.0
         Environment: spark version:Spark 2.2
master: yarn
deploy-mode: cluster

            Reporter: Lijia Liu


Sometimes, Hive will create an empty table with many empty files, Spark use the InputFormat stored in Hive Meta Store and will not combine the empty files and therefore generate many tasks to handle this empty files.
Hive use CombineHiveInputFormat(hive.input.format) by default.
So, in this case, Spark will spends much more resources than hive.

3 suggestions:
1. add a configuration, filter out empty InputSplit in HadoopRDD.
2. add a configuration, user can customize the inputformatclass in HadoopTableReader.
3. use the InputFormatClass configured in hive configuration(hive.input.format).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org