You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (JIRA)" <ji...@apache.org> on 2018/09/11 19:00:02 UTC

[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

    [ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611089#comment-16611089 ] 

Ruslan Dautkhanov edited comment on SPARK-25164 at 9/11/18 7:00 PM:
--------------------------------------------------------------------

Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *reading 70k rows for over 10 minutes* with multiple executors is still quite slow. 

Do you think there might be other issue? So it seems time complexity of reading parquet files is O(num_columns * num_parquet_files)?
 Is there is any way to optimize this further?

Thanks.

 


was (Author: tagar):
Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *saving 70k rows for over 10 minutes* with multiple executors is still quite slow. 

Do you think there might be other issue? So it seems time complexity of reading parquet files is O(num_columns * num_parquet_files)?
Is there is any way to optimize this further?

Thanks.

 

> Parquet reader builds entire list of columns once for each column
> -----------------------------------------------------------------
>
>                 Key: SPARK-25164
>                 URL: https://issues.apache.org/jira/browse/SPARK-25164
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Bruce Robbins
>            Assignee: Bruce Robbins
>            Priority: Minor
>             Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from getPaths(0).
> {noformat}
>   public List<ColumnDescriptor> getColumns() {
>     List<String[]> paths = this.getPaths(0);
>     List<ColumnDescriptor> columns = new ArrayList<ColumnDescriptor>(paths.size());
>     for (String[] path : paths) {
>       // TODO: optimize this                                                                                                                    
>       PrimitiveType primitiveType = getType(path).asPrimitiveType();
>       columns.add(new ColumnDescriptor(
>                       path,
>                       primitiveType,
>                       getMaxRepetitionLevel(path),
>                       getMaxDefinitionLevel(path)));
>     }
>     return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table with 6000 columns of type double and 67 files (so initializeInternal is called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter returns only a few thousand records. The query ran (on average) for 6.4 minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List<ColumnDescriptor> columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my simple query, you save %22 of time by not rebuilding the column list for each column.
> You get additional savings with a paths cache variable, now saving 34% in total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org