You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "andrzej.stankevich@gmail.com (JIRA)" <ji...@apache.org> on 2018/07/30 22:36:00 UTC
[jira] [Created] (SPARK-24974) Spark put all file's paths into
SharedInMemoryCache even for unused partitions.
andrzej.stankevich@gmail.com created SPARK-24974:
----------------------------------------------------
Summary: Spark put all file's paths into SharedInMemoryCache even for unused partitions.
Key: SPARK-24974
URL: https://issues.apache.org/jira/browse/SPARK-24974
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.2.1
Reporter: andrzej.stankevich@gmail.com
SharedInMemoryCache has all filestatus no matter whether you specify partition columns or not. It causes long load time for queries that use only couple partitions because Spark loads file's paths for files from all partitions.
I partitioned files by type and i has directory structure like
{code}
{{report_date=2018-07-24/type=A/file_1}}
{code}
I am trying to execute
{code}
{{val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( "type == 'A'").count}}
{code}
In my query i need to load only files of type A and it is just couple of files. But spark load all 19K of files into SharedInMemoryCache which takes about 60 secs and only after that throws unused partitions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org