You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Omer Ozarslan (Jira)" <ji...@apache.org> on 2021/02/11 15:55:00 UTC

[jira] [Created] (SPARK-34423) Allow FileTable.fileIndex to be reused for custom partition schema in DataSourceV2 read path

Omer Ozarslan created SPARK-34423:
-------------------------------------

             Summary: Allow FileTable.fileIndex to be reused for custom partition schema in DataSourceV2 read path
                 Key: SPARK-34423
                 URL: https://issues.apache.org/jira/browse/SPARK-34423
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.1
            Reporter: Omer Ozarslan


It is currently possible to provide custom partition schema in DataSourceV2 read path with custom implementations of PartitionAwareFileIndex/PartitionSpec and by overriding fileIndex in a subclass of FileTable. Since fileIndex is lazy val it's not possible to reuse it from the subclass however (i.e. super.fileIndex).

[https://github.com/apache/spark/blob/e0053853c90d39ef6de9d59fb933525e20bae1fa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala#L44-L61]

Duplicating this code in the subclass is possible but somewhat hacky e.g. DataSource globbing function is private API. I was wondering if this logic can be refactored into something like this:
{code:java}
def createFileIndex(): PartitionAwareFileIndex = {
  ...[current fileIndex logic]...
}

lazy val fileIndex: PartitionAwareFileIndex = createFileIndex(){code}
This would allow reusing fileIndex logic downstream by wrapping it up with custom implementations.

(Note that this proposed change considers custom partition schema in read path only. Write path is out of the scope of this change.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org