You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/07/22 21:18:00 UTC

[jira] [Commented] (SPARK-30616) Introduce TTL config option for SQL Metadata Cache

    [ https://issues.apache.org/jira/browse/SPARK-30616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163080#comment-17163080 ] 

Apache Spark commented on SPARK-30616:
--------------------------------------

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29194

> Introduce TTL config option for SQL Metadata Cache
> --------------------------------------------------
>
>                 Key: SPARK-30616
>                 URL: https://issues.apache.org/jira/browse/SPARK-30616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yaroslav Tkachenko
>            Assignee: Yaroslav Tkachenko
>            Priority: Major
>             Fix For: 3.1.0
>
>
> From [documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
> {quote}Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.
> {quote}
> Currently Spark [caches file listing for tables|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing] and requires issuing "{{REFRESH TABLE"}} any time the file listing has changed outside of Spark. Unfortunately, simply submitting "{{REFRESH TABLE"}} commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution.
> This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.).
> A similar feature exists in Presto: {{hive.file-status-cache-expire-time}} can be found [here|https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties].
> I propose to introduce a new option in Spark (something like "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this metadata cache. It can be disabled by default (-1), so it doesn't change the existing behaviour. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org