You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/13 15:55:02 UTC

[GitHub] gengliangwang opened a new pull request #23774: [SPARK-26871][SQL]File Source V2: avoid creating unnecessary FileIndex in the write path

gengliangwang opened a new pull request #23774: [SPARK-26871][SQL]File Source V2: avoid creating unnecessary FileIndex in the write path
URL: https://github.com/apache/spark/pull/23774
 
 
   ## What changes were proposed in this pull request?
   
   In https://github.com/apache/spark/pull/23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like https://github.com/apache/spark/commit/0f9fcabb4ac2e8afec14d010e86467372a85d334 in the future(As data source V2 catalog is under development, partition pruning is removed from the PR)
   
   However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files
   ``` 
   WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently?
   ```
   This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path.
   
   ## How was this patch tested?
   
   Existing unit test
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org