You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "attilapiros (via GitHub)" <gi...@apache.org> on 2023/07/24 16:38:23 UTC

[GitHub] [spark] attilapiros commented on pull request #41628: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions

attilapiros commented on PR #41628:
URL: https://github.com/apache/spark/pull/41628#issuecomment-1648248050

   @jeanlyn data consistency is more important than performance even if the chance to violate it is extremely rare.
   
   I have an idea: what about introducing  `getPartitionsWithCustomLocation` into the HiveShim layer. The implementation can be reflection based (of course Hive should be extended with this new method too) which could fallback to asking all the partitions and do the filtering (default location != actual location) in the Spark side (to be able to. use earlier hive versions).  This way the old insert into code will be speed up as partitions with custom locations is rare (memory pressure will be solved and the even the code will be simple). 
   
   For reflection and fallbacks please check the existing code like:
   https://github.com/apache/spark/blob/1b430ef2c9a68c3d7e09727eaa8c233ade13b4df/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1131
   
   What do you think?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org