You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tomas Bartalos (JIRA)" <ji...@apache.org> on 2019/08/03 23:12:00 UTC
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

    [ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899534#comment-16899534 ] 

Tomas Bartalos commented on SPARK-4502:
---------------------------------------

I'm fighting with the same problem, few findings which are not obvious after reading the issue:
 * To use this feature, set _spark.sql.optimizer.nestedSchemaPruning.enabled=true_
 * According to my tests, fix works only for 1 level of nesting. For example _event.amount_ is ok, while _event.spent.amount_ reads the whole event structure :(

The only way how to optimise read of 2+ level nesting is to specify projected schema at read time (as suggested by [~aeroevan]):

_val df = spark.read.format("parquet").schema(projectedSchema).load(<path>)_

*This workaround can't be used on queries from thrift server and connected BI tools. Needless to say having this feature to work on any level of nesting would be just wonderful.*

This is a real performance killer for big deeply nested structures. My test for summing one field on top level vs. nested shows difference 2.5 min vs. 4 seconds

 

> Spark SQL reads unneccesary nested fields from Parquet
> ------------------------------------------------------
>
>                 Key: SPARK-4502
>                 URL: https://issues.apache.org/jira/browse/SPARK-4502
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Liwen Sun
>            Assignee: Michael Allman
>            Priority: Critical
>             Fix For: 2.4.0
>
>
> When reading a field of a nested column from Parquet, SparkSQL reads and assemble all the fields of that nested column. This is unnecessary, as Parquet supports fine-grained field reads out of a nested column. This may degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only needs 1 column. I also measured the bytes read within Parquet. In these two cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org