You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "guixiaowen (via GitHub)" <gi...@apache.org> on 2023/12/13 11:09:34 UTC

Re: [PR] [SPARK-45894][SQL] hive table level setting hadoop.mapred.max.split.size [spark]

guixiaowen commented on PR #43768:
URL: https://github.com/apache/spark/pull/43768#issuecomment-1853714954

   > Thank you for making a PR, @guixiaowen . I have a few comments.
   > 
   > 1. Apache Spark 3.2.0 is the end of life as of now. So, this improvement should have 4.0.0.
   > 
   > ```
   > - .version("3.2.0")
   > + .version("4.0.0")
   > ```
   > 
   > 2. Given that this is a Hive-table-specific configuration and will not change during the lifetime of Spark jobs, please move the config to `HiveUtils.scala` like the following.
   > 
   > https://github.com/apache/spark/blob/9987cab84bb05a61a5c0b43f94a733561a2e074a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L157
   > 
   > 3. The config name, `spark.hadoop.mapred.`, looks a little weird because `org.apache.hadoop.mapred` is the older API and `org.apache.hadoop.mapreduce` is the new one. In addition, this PR also uses `mapreduce.input.*` in the code.
   > 4. We are careful about adding a new namespace in the new config addition. For example, `spark.hadoop.mapred.max.split.size.by.table`, is creating multiple namespaces like the following. If there is no other sibiling in the same namespace, we should not use `.`.
   > 
   > ```
   > spark.hadoop.mapred.*
   > spark.hadoop.mapred.max.*
   > spark.hadoop.mapred.max.split.*
   > spark.hadoop.mapred.max.split.size.*
   > spark.hadoop.mapred.max.split.size.by.*
   > ```
   > 
   > 5. I'm wondering if this is a general improvement. Specifically, Hive Table property is helpful your case or not? For example, IIUC, Spark understands `TBLPROPERTIES ("orc.compress"="ZLIB")` style. Do you need to put your `mapreduce.input.fileinputformat.split.maxsize` into Hive Table property in the same way instead of adding a new Spsark configuration?
   > 6. We need a test coverage to protect your PR from the possible future regressions.
   
   Thank you very much for viewing.
   I will modify it according to the points you mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org