You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/08/18 20:02:28 UTC

[jira] [Resolved] (SPARK-3091) Add support for caching metadata on Parquet files

     [ https://issues.apache.org/jira/browse/SPARK-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust resolved SPARK-3091.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1.0

> Add support for caching metadata on Parquet files
> -------------------------------------------------
>
>                 Key: SPARK-3091
>                 URL: https://issues.apache.org/jira/browse/SPARK-3091
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>            Priority: Blocker
>             Fix For: 1.1.0
>
>
> For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org