You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rao Fu (JIRA)" <ji...@apache.org> on 2018/08/17 23:33:00 UTC
[jira] [Updated] (SPARK-25126) avoid creating OrcFile.Reader for all orc files

     [ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rao Fu updated SPARK-25126:
---------------------------
       Priority: Minor  (was: Major)
    Description: 
We have a spark job that starts by reading orc files under an S3 directory and we noticed the job consumes a lot of memory when both the number of orc files and the size of the file are large. The memory bloat went away with the following workaround.

1) create a DataSet<Row> from a single orc file.

Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile);

2) when creating DataSet<Row> from all files under the directory, use the schema from the previous DataSet.

Dataset<Row> rows = spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);

I believe the issue is due to the fact in order to infer the schema a FileReader is created for each orc file under the directory although only the first one is used. The FileReader creation loads the metadata of the orc file and the memory consumption is very high when there are many files under the directory.

The issue exists in both 2.0 and HEAD.

In 2.0, OrcFileOperator.readSchema is used.

[https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]

In HEAD, OrcUtils.readSchema is used.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82

 

 

  was:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L73

Where `basePath` passed to getFileReader is a directory, a OrcFile.Reader is created for every file under the directory although only the first one with a non-empty schema is returned. It consumes a lot of memory when there are many files under the directory as the metadata for the orc file is loaded into memory during the Reader creation.

I tried the following workaround and the OOM issue went away,

1) create a DataSet<Row> from a single orc file.

Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile);

2) when creating DataSet<Row> from all files under the directory, use the schema from the previous DataSet.


Dataset<Row> rows = spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);

 

 

        Summary: avoid creating OrcFile.Reader for all orc files  (was: OrcFileOperator.getFileReader: avoid creating OrcFile.Reader for all orc files)

> avoid creating OrcFile.Reader for all orc files
> -----------------------------------------------
>
>                 Key: SPARK-25126
>                 URL: https://issues.apache.org/jira/browse/SPARK-25126
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.3.1
>            Reporter: Rao Fu
>            Priority: Minor
>
> We have a spark job that starts by reading orc files under an S3 directory and we noticed the job consumes a lot of memory when both the number of orc files and the size of the file are large. The memory bloat went away with the following workaround.
> 1) create a DataSet<Row> from a single orc file.
> Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet<Row> from all files under the directory, use the schema from the previous DataSet.
> Dataset<Row> rows = spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a FileReader is created for each orc file under the directory although only the first one is used. The FileReader creation loads the metadata of the orc file and the memory consumption is very high when there are many files under the directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org