You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/07 20:59:37 UTC

[GitHub] [hudi] umehrot2 commented on a diff in pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

umehrot2 commented on code in PR #5786:
URL: https://github.com/apache/hudi/pull/5786#discussion_r916237629


##########
pom.xml:
##########
@@ -1659,6 +1751,7 @@
         <hudi.spark.common.module>hudi-spark3-common</hudi.spark.common.module>
         <scalatest.version>${scalatest.spark3.version}</scalatest.version>
         <kafka.version>${kafka.spark3.version}</kafka.version>
+        <scalatest.version>3.1.0</scalatest.version>

Review Comment:
   This seems duplicate as it already defined above.



##########
pom.xml:
##########
@@ -1464,9 +1508,19 @@
           <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-compiler-plugin</artifactId>
+            <version>${maven-compiler-plugin.version}</version>
             <configuration>
+
               <source>${java.version}</source>
               <target>${java.version}</target>
+              <compilerArgs>
+                <arg>-verbose</arg>

Review Comment:
   I don't think we should check in with `-verbose` to further increases the output logs.



##########
pom.xml:
##########
@@ -380,17 +391,24 @@
           <artifactId>maven-jar-plugin</artifactId>
           <version>${maven-jar-plugin.version}</version>
         </plugin>
+        <plugin>
+          <groupId>org.apache.maven.plugins</groupId>
+          <artifactId>maven-dependency-plugin</artifactId>
+          <version>${maven-dependency-plugin.version}</version>
+        </plugin>
         <plugin>
           <groupId>net.alchim31.maven</groupId>
           <artifactId>scala-maven-plugin</artifactId>
           <version>${scala-maven-plugin.version}</version>
           <configuration>
+            <recompileMode>all</recompileMode>

Review Comment:
   Curious to understand the reason for this ?



##########
pom.xml:
##########
@@ -1464,9 +1508,19 @@
           <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-compiler-plugin</artifactId>
+            <version>${maven-compiler-plugin.version}</version>
             <configuration>
+

Review Comment:
   nit: unnecessary line



##########
packaging/hudi-timeline-server-bundle/pom.xml:
##########
@@ -102,6 +102,10 @@
                     <groupId>javax.servlet</groupId>
                     <artifactId>*</artifactId>
                 </exclusion>
+                <exclusion>
+                    <groupId>org.eclipse.jetty</groupId>
+                    <artifactId>*</artifactId>
+                </exclusion>

Review Comment:
   Same here. Why not just exclude this while defining the dependency in the root POM instead of individually excluding it at all places ? This will not be manageable in the long run.



##########
hudi-utilities/pom.xml:
##########
@@ -215,6 +216,14 @@
           <groupId>javax.servlet</groupId>
           <artifactId>*</artifactId>
         </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-api</artifactId>
+        </exclusion>

Review Comment:
   Any dependency that we are excluding in sub modules, please review whether we can simply exclude it in the root POM itself. It will avoid these random code changes throughout the project.



##########
hudi-utilities/pom.xml:
##########
@@ -39,9 +39,10 @@
       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
+        <version>${maven-compiler-plugin.version}</version>
         <configuration>
-          <source>1.8</source>
-          <target>1.8</target>
+          <source>${java.version}</source>
+          <target>${java.version}</target>
         </configuration>

Review Comment:
   Does this even need to be redefined here ? Its already defined in root pom.



##########
hudi-utilities/pom.xml:
##########
@@ -233,6 +242,17 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-hive_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>

Review Comment:
   Why is this dependency needed here ?



##########
hudi-spark-datasource/hudi-spark3/pom.xml:
##########
@@ -262,7 +261,7 @@
 
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark3-common</artifactId>
+      <artifactId>${hudi.spark.common.module}</artifactId>

Review Comment:
   Should this not be `hudi.spark3.common.module` ? Otherwise it may use the wrong module when you compile with `spark2` profile.



##########
pom.xml:
##########
@@ -1669,15 +1762,17 @@
         <fasterxml.jackson.dataformat.yaml.version>${fasterxml.spark3.version}
         </fasterxml.jackson.dataformat.yaml.version>
         <skip.hudi-spark2.unit.tests>true</skip.hudi-spark2.unit.tests>
-        <skipITs>true</skipITs>
+        <skipITs>false</skipITs>
       </properties>
       <modules>
         <module>hudi-spark-datasource/hudi-spark3</module>
         <module>hudi-spark-datasource/hudi-spark3-common</module>
       </modules>
       <activation>
+        <activeByDefault>true</activeByDefault>
         <property>
           <name>spark3</name>
+          <value>!disabled</value>

Review Comment:
   Not sure I understand this. Is this still a valid problem or can this comment be marked resolved ?



##########
packaging/hudi-utilities-bundle/pom.xml:
##########
@@ -409,6 +409,12 @@
       <artifactId>hive-service</artifactId>
       <version>${hive.version}</version>
       <scope>${utilities.bundle.hive.scope}</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   You have already excluded jetty while defining the dependencies in root POM. You should not have to do that individually in each and every POM file now.



##########
pom.xml:
##########
@@ -110,19 +113,21 @@
     <log4j.test.version>2.17.2</log4j.test.version>
     <slf4j.version>1.7.30</slf4j.version>
     <joda.version>2.9.9</joda.version>
-    <hadoop.version>2.10.1</hadoop.version>
+    <hadoop.version>3.1.0</hadoop.version>

Review Comment:
   I can't remember if its a conscious decision that we already discussed, but whats the reason for not upgrading hadoop to 3.2.1 instead ?



##########
pom.xml:
##########
@@ -78,28 +78,31 @@
 
   <properties>
     <maven-jar-plugin.version>3.2.0</maven-jar-plugin.version>
+    <maven-dependency-plugin.version>3.3.0</maven-dependency-plugin.version>
     <maven-surefire-plugin.version>2.22.2</maven-surefire-plugin.version>
     <maven-failsafe-plugin.version>2.22.2</maven-failsafe-plugin.version>
     <maven-shade-plugin.version>3.2.4</maven-shade-plugin.version>
     <maven-javadoc-plugin.version>3.1.1</maven-javadoc-plugin.version>
-    <maven-compiler-plugin.version>3.8.0</maven-compiler-plugin.version>
+    <maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
     <maven-deploy-plugin.version>2.4</maven-deploy-plugin.version>
     <genjavadoc-plugin.version>0.15</genjavadoc-plugin.version>
     <build-helper-maven-plugin.version>1.7</build-helper-maven-plugin.version>
     <maven-enforcer-plugin.version>3.0.0-M1</maven-enforcer-plugin.version>
     <maven-docker-plugin.version>0.37.0</maven-docker-plugin.version>
 
     <java.version>1.8</java.version>
-    <fasterxml.version>2.6.7</fasterxml.version>
-    <fasterxml.jackson.databind.version>2.6.7.3</fasterxml.jackson.databind.version>
-    <fasterxml.jackson.module.scala.version>2.6.7.1</fasterxml.jackson.module.scala.version>
-    <fasterxml.jackson.dataformat.yaml.version>2.7.4</fasterxml.jackson.dataformat.yaml.version>
+    <fasterxml.version>${fasterxml.spark3.version}</fasterxml.version>
+    <fasterxml.jackson.databind.version>${fasterxml.spark3.version}</fasterxml.jackson.databind.version>
+    <fasterxml.jackson.module.scala.version>${fasterxml.spark3.version}</fasterxml.jackson.module.scala.version>
+    <fasterxml.jackson.dataformat.yaml.version>${fasterxml.spark3.version}</fasterxml.jackson.dataformat.yaml.version>
     <fasterxml.spark3.version>2.10.0</fasterxml.spark3.version>
-    <kafka.version>2.0.0</kafka.version>
-    <kafka.spark3.version>2.4.1</kafka.spark3.version>
+    <kafka.version>${kafka.spark3.version}</kafka.version>
+    <kafka.spark2.version>2.0.0</kafka.spark2.version>
+    <kafka.spark3.version>2.8.0</kafka.spark3.version>
     <pulsar.version>2.8.1</pulsar.version>
     <confluent.version>5.3.4</confluent.version>
     <glassfish.version>2.17</glassfish.version>
+    <parquet.version>1.12.1</parquet.version>

Review Comment:
   This should be 1.12.2 right ? Atleast thats what I see being used on the master already for Spark 3.



##########
packaging/hudi-integ-test-bundle/pom.xml:
##########
@@ -482,6 +482,12 @@
       <artifactId>hadoop-hdfs</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   Okay, so a general high level comment here. Throughout the PR there are jetty dependency exclusions added which is creating a lot of unnecessary diff. Ideally this should all just be managed through the root POMs `dependencyManagement` instead of excluding it everywhere. We should clean up these diffs from the PR.



##########
packaging/hudi-hadoop-mr-bundle/pom.xml:
##########
@@ -29,6 +29,8 @@
   <properties>
     <checkstyle.skip>true</checkstyle.skip>
     <main.basedir>${project.parent.basedir}</main.basedir>
+    <!-- override parquet version to be same as Hive 3.1.2 -->
+    <parquet.version>1.10.1</parquet.version>

Review Comment:
   Can we instead add like `hive.parquet.version` in the root POM and use that for the parquet dependencies here ? Changing the version in the internal POMs like this becomes hard to track. If we atleast have a property defined around this, it will be easier to remember in future that we are dealing with multiple parquet versions in the project.



##########
hudi-spark-datasource/hudi-spark/pom.xml:
##########
@@ -202,6 +202,12 @@
       <groupId>org.apache.hudi</groupId>
       <artifactId>hudi-common</artifactId>
       <version>${project.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hive</groupId>
+          <artifactId>hive-storage-api</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   I don't think you need to do this exclusion. You are already explicitly defining this dependency in this POM fixed to a specific version.



##########
hudi-flink-datasource/hudi-flink/pom.xml:
##########
@@ -45,8 +45,8 @@
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-compiler-plugin</artifactId>
                 <configuration>
-                    <source>1.8</source>
-                    <target>1.8</target>
+                    <source>${java.version}</source>
+                    <target>${java.version}</target>

Review Comment:
   Can we avoid these changes ? Just having it in root pom should be sufficient.



##########
hudi-timeline-service/pom.xml:
##########
@@ -117,6 +123,28 @@
       <artifactId>rocksdbjni</artifactId>
     </dependency>
 
+    <!-- Jetty -->
+    <dependency>
+      <groupId>org.eclipse.jetty</groupId>
+      <artifactId>jetty-server</artifactId>
+      <version>${jetty.version}</version>
+    </dependency>

Review Comment:
   How was this module getting the jetty dependencies earlier ? Was it happening transitively ?
   
   Also no need to mention `jetty.version` while defining again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org