You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:13:50 UTC

[jira] [Resolved] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

     [ https://issues.apache.org/jira/browse/SPARK-19351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-19351.
----------------------------------
    Resolution: Incomplete

> Support for obtaining file splits from underlying InputFormat
> -------------------------------------------------------------
>
>                 Key: SPARK-19351
>                 URL: https://issues.apache.org/jira/browse/SPARK-19351
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Vinoth Chandar
>            Priority: Major
>              Labels: bulk-closed
>
> This is a request for a feature, that enables SparkSQL to obtain the files for a Hive partition, by calling inputFormat.getSplits(), as opposed to listing files directly, while still using Spark's optimized Parquet readers for actual IO. (Note that the difference between this and falling back entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we get to realize benefits such as new parquet reader, schema merging etc in SparkSQL)
> Some background the context, using our use-case at Uber. We have Hive tables, where each partition contains versioned files (whenever records in a file change, we produce a new version, to speed up database ingestion) and such tables are registerd with a custom InputFormat that just filters out old versions and just returns the latest version of each file to the query. 
> We have this working for 5 months now across Hive/Spark/Presto as follows 
> - Hive : Works out of box, by calling the inputFormat.getSplits, so we are good there
> - Presto: We made the fix in Presto, similar to whats proposed here. 
> - Spark : We set convertMetastoreParquet=false. Perf is actually comparable for our use-cases, but we run into schema merging issues now and then. 
> we have explored a few approaches here  and would like to get more feedback from you all, before we go further.. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org