You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/19 06:11:53 UTC

[GitHub] [spark] sunchao commented on pull request #38277: [SPARK-40815][SQL] Disable "spark.hadoopRDD.ignoreEmptySplits" in order to fix the correctness issue when using Hive SymlinkTextInputFormat

sunchao commented on PR #38277:
URL: https://github.com/apache/spark/pull/38277#issuecomment-1283482120

   Is it possible to treat `SymlinkTextInputFormat` specially in `NewHadoopRDD` and `HadoopRDD`, where the `spark.hadoopRDD.ignoreEmptySplits` is used? e.g. something like:
   
   ```scala
         val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, jobId)).asScala
         val rawSplits = if (ignoreEmptySplits &&
             inputFormat.getClass.getName != "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat") {
           allRowSplits.filter(_.getLength > 0)
         } else {
           allRowSplits
         }
   ```
   
   I think a fix in Spark itself will be a good short term solution. A fix in Hive appears to be more involved. For one, it's hard to give a reasonable start and length for `SymlinkTextInputSplit` since it's just a link to a list of paths, and I'm not sure if changing the class would affect other places within Hive (as this class has been there for a very long time).
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org