You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/06 02:29:02 UTC
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6550: [HUDI-4691] Cleaning up duplicated classes in Spark 3.3 module

nsivabalan commented on code in PR #6550:
URL: https://github.com/apache/hudi/pull/6550#discussion_r963185123


##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -79,10 +79,6 @@ object HoodieAnalysis {
       val spark3Analysis: RuleBuilder =
         session => ReflectionUtils.loadClass(spark3AnalysisClass, session).asInstanceOf[Rule[LogicalPlan]]
 
-      val spark3ResolveReferencesClass = "org.apache.spark.sql.hudi.analysis.HoodieSpark3ResolveReferences"

Review Comment:
   may be not related to changes in this patch. but "HoodieSpark3Analysis" actually refers to 3.2 or greater right? but the naming is not right. I don't see it being used for 3.1 below. 



##########
hudi-utilities/pom.xml:
##########
@@ -139,14 +125,22 @@
         </exclusion>
       </exclusions>
     </dependency>
+
+    <!-- Hoodie - Other -->
+
+    <!-- NOTE: This dependency is kept around even though it's not necessary, due to the fact
+               that removing it messes up Orc dependencies on the class-path failing to compile
+               test for Spark 3.x (this occurs b/c Hudi depends on "nohive" version of "orc-core"
+               while Spark 3.x switches to a regular one) -->
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark-common_${scala.binary.version}</artifactId>
+      <artifactId>hudi-common</artifactId>

Review Comment:
   hudi-spark pulls in hudi-common as well right. do we need to explicitly depend here? 



##########
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieCatalog.scala:
##########
@@ -346,20 +346,25 @@ class HoodieCatalog extends DelegatingCatalogExtension
 }
 
 object HoodieCatalog {
-  def convertTransforms(partitions: Seq[Transform]): (Seq[String], Option[BucketSpec]) = {
+  def convertTransforms(transforms: Seq[Transform]): (Seq[String], Option[BucketSpec]) = {
     val identityCols = new mutable.ArrayBuffer[String]
     var bucketSpec = Option.empty[BucketSpec]
 
-    partitions.map {
+    transforms.foreach {
       case IdentityTransform(FieldReference(Seq(col))) =>
         identityCols += col
 
+      case MatchBucketTransform(numBuckets, col, sortCol) =>

Review Comment:
   is this some bug fix. doesn't look like pure refactoring. 



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala:
##########
@@ -79,10 +79,6 @@ object HoodieAnalysis {
       val spark3Analysis: RuleBuilder =
         session => ReflectionUtils.loadClass(spark3AnalysisClass, session).asInstanceOf[Rule[LogicalPlan]]
 
-      val spark3ResolveReferencesClass = "org.apache.spark.sql.hudi.analysis.HoodieSpark3ResolveReferences"

Review Comment:
   may be you can name the variable in L78, 79 accordingly. 



##########
pom.xml:
##########
@@ -1938,7 +1910,8 @@
         <scala.version>${scala12.version}</scala.version>
         <scala.binary.version>2.12</scala.binary.version>
         <hudi.spark.module>hudi-spark3.2.x</hudi.spark.module>
-        <hudi.spark.common.module>hudi-spark3-common</hudi.spark.common.module>
+        <!-- This glob has to include hudi-spark3-common, hudi-spark3.2plus-common -->
+        <hudi.spark.common.modules.glob>hudi-spark3*-common</hudi.spark.common.modules.glob>

Review Comment:
   +1 prefer to explicitly add 



##########
hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieStagedTable.scala:
##########
@@ -29,7 +30,7 @@ import org.apache.spark.sql.types.StructType
 
 import java.net.URI
 import java.util
-import scala.collection.JavaConverters.{mapAsScalaMapConverter, setAsJavaSetConverter}
+import scala.jdk.CollectionConverters.{mapAsScalaMapConverter, setAsJavaSetConverter}

Review Comment:
   was this intentional change? I see we widely use `scala.collection.JavaConverters.` ? If you prefer to replace all usages,can we file a tracking jira and do it separately. 



##########
hudi-spark-datasource/hudi-spark3.2plus-common/pom.xml:
##########
@@ -0,0 +1,234 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <parent>
+        <artifactId>hudi-spark-datasource</artifactId>
+        <groupId>org.apache.hudi</groupId>
+        <version>0.13.0-SNAPSHOT</version>
+    </parent>
+    <modelVersion>4.0.0</modelVersion>
+
+    <artifactId>hudi-spark3.2plus-common</artifactId>

Review Comment:
   yes. the other option also does not work smoothly either. for eg, we could name this spark3.2_3.3. what incase spark3.4 is compatible. either ways, we have to rename some of these when we have more patched version or minor versions. probably we can go w/ what we have in this patch. 



##########
hudi-spark-datasource/hudi-spark3.1.x/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
##########
@@ -16,4 +16,4 @@
 # limitations under the License.
 
 
-org.apache.hudi.Spark3xDefaultSource
\ No newline at end of file
+org.apache.hudi.Spark31DefaultSource

Review Comment:
   should we keep it as Spark31xDefaultSource



##########
pom.xml:
##########
@@ -377,9 +377,17 @@
                     <exclude>org.sl4fj:slf4j-jcl</exclude>
                     <exclude>log4j:log4j</exclude>
                     <exclude>ch.qos.logback:logback-classic</exclude>
+                    <!-- NOTE: We're banning any HBase deps versions other than the approved ${hbase.version},
+                               which is aimed at preventing the classpath collisions w/ transitive deps usually) -->
+                    <exclude>org.apache.hbase:hbase-common:*</exclude>

Review Comment:
   @yihua : can you review hbase related changes. mostly should be good, just wanted to double confirm.



##########
hudi-spark-datasource/hudi-spark3.1.x/src/main/scala/org/apache/spark/sql/avro/HoodieSpark3_1AvroDeserializer.scala:
##########
@@ -18,12 +18,19 @@
 package org.apache.spark.sql.avro
 
 import org.apache.avro.Schema
+import org.apache.spark.sql.catalyst.NoopFilters
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy
 import org.apache.spark.sql.types.DataType
 
 class HoodieSpark3_1AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType)
   extends HoodieAvroDeserializer {
 
-  private val avroDeserializer = new AvroDeserializer(rootAvroType, rootCatalystType)
+  private val avroDeserializer = {
+    val avroRebaseModeInRead = LegacyBehaviorPolicy.withName(SQLConf.get.getConf(SQLConf.LEGACY_AVRO_REBASE_MODE_IN_READ))

Review Comment:
   as per master, I see this conf is of interest only for spark3.2 and above. But now, we are also adding it for spark3.1. is that intentional ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org