You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/02/04 20:07:00 UTC

[jira] [Updated] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

     [ https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-30616:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Introduce TTL config option for SQL Parquet Metadata Cache
> ----------------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Priority: Major
>
> From [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Unfortunately simply submitting "REFRESH TABLE"  commands could be very cumbersome. Assuming frequently generated new Parquet files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not an optimal solution. And this is a pretty common use-case for streaming ingestion of data.    
> I propose to introduce a new option in Spark (something like "spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of this metadata cache. Its default value can be pretty high (an hour? a few hours?), so it doesn't alter the existing behavior much. When it's set to 0 the cache is effectively disabled (could be useful for testing or some edge cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org