You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/07 21:59:52 UTC

[GitHub] [hudi] yihua commented on a diff in pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

yihua commented on code in PR #5786:
URL: https://github.com/apache/hudi/pull/5786#discussion_r916291602


##########
docker/compose/hadoop.env:
##########
@@ -21,6 +21,15 @@ HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
 HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
 HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
 HIVE_SITE_CONF_hive_metastore_uris=thrift://hivemetastore:9083
+HIVE_SITE_CONF_hive_metastore_uri_resolver=org.apache.hudi.hadoop.hive.NoOpMetastoreUriResolverHook
+HIVE_SITE_CONF_hive_metastore_event_db_notification_api_auth=false
+HIVE_SITE_CONF_hive_execution_engine=mr

Review Comment:
   MapReduce is no longer supported in Hive 3.  Should this be configured with `tez`?



##########
azure-pipelines.yml:
##########
@@ -200,27 +223,22 @@ stages:
               mavenOptions: '-Xmx4g'
       - job: IT
         displayName: IT modules
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '180'
         steps:
           - task: Maven@3
             displayName: maven install
+            continueOnError: true
+            retryCountOnTaskFailure: 2
             inputs:
               mavenPomFile: 'pom.xml'
               goals: 'clean install'
               options: $(MVN_OPTS_INSTALL) -Pintegration-tests
               publishJUnitResults: false
               jdkVersionOption: '1.8'
-          - task: Maven@3

Review Comment:
   Is there still a problem to run the unit tests specified here?



##########
hudi-examples/hudi-examples-flink/src/test/java/org/apache/hudi/examples/quickstart/TestHoodieFlinkQuickstart.java:
##########
@@ -45,6 +46,7 @@ void beforeEach() {
   @TempDir
   File tempFile;
 
+  @Disabled

Review Comment:
   Could you add the Jira ticket number for all the disabled tests?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java:
##########
@@ -189,7 +190,13 @@ public static Writable avroToArrayWritable(Object value, Schema schema) {
         Writable[] recordValues = new Writable[schema.getFields().size()];
         int recordValueIndex = 0;
         for (Schema.Field field : schema.getFields()) {
-          recordValues[recordValueIndex++] = avroToArrayWritable(record.get(field.name()), field.schema());
+          Object fieldValue = null;
+          try {
+            fieldValue = record.get(field.name());
+          } catch (AvroRuntimeException e) {
+            LOG.debug("Field:" + field.name() + "not found in Schema:" + schema.toString());

Review Comment:
   Should this throw an exception instead of debug logging?



##########
hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java:
##########
@@ -338,8 +348,8 @@ private void saveUpLogs() {
           executeCommandStringInDocker(HIVESERVER, "cat /tmp/root/hive.log |  grep -i exception -A 10 -B 5", false).getStdout().toString();
       String filePath = System.getProperty("java.io.tmpdir") + "/" + System.currentTimeMillis() + "-hive.log";
       FileIOUtils.writeStringToFile(hiveLogStr, filePath);
-      LOG.info("Hive log saved up at  : " + filePath);
-      LOG.info("<===========  Full hive log ===============>\n"
+      LOG.error("Hive log saved up at  : " + filePath);

Review Comment:
   Similar here for the clean-up.



##########
hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java:
##########
@@ -236,8 +243,11 @@ private TestExecStartResultCallback executeCommandInDocker(String containerName,
     int exitCode = dockerClient.inspectExecCmd(createCmdResponse.getId()).exec().getExitCode();
     LOG.info("Exit code for command : " + exitCode);
     if (exitCode != 0) {
-      LOG.error("\n\n ###### Stdout #######\n" + callback.getStdout().toString());
+      //LOG.error("\n\n ###### Stdout #######\n" + callback.getStdout().toString());
     }
+    callback.getStderr().flush();
+    callback.getStdout().flush();
+    LOG.error("\n\n ###### Stdout #######\n" + callback.getStdout().toString());

Review Comment:
   A reminder to remove these changes before merging.



##########
hudi-spark-datasource/hudi-spark3-common/pom.xml:
##########
@@ -166,7 +166,7 @@
 
         <dependency>
             <groupId>org.apache.spark</groupId>
-            <artifactId>spark-sql_2.12</artifactId>
+            <artifactId>spark-sql_${spark3.scala.binary.version}</artifactId>

Review Comment:
   Is this necessary since `spark3.scala.binary.version` should always be `2.12`?



##########
hudi-spark-datasource/hudi-spark3.1.x/pom.xml:
##########
@@ -24,7 +24,7 @@
   <artifactId>hudi-spark3.1.x_2.12</artifactId>
   <version>0.12.0-SNAPSHOT</version>
 
-  <name>hudi-spark3.1.x_2.12</name>
+  <name>hudi-spark3.1.x_${spark3.scala.binary.version}</name>

Review Comment:
   Same here



##########
azure-pipelines.yml:
##########
@@ -200,27 +223,22 @@ stages:
               mavenOptions: '-Xmx4g'
       - job: IT
         displayName: IT modules
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '180'

Review Comment:
   Do you know why it takes longer for the Spark 3 tests?  Or this change is not necessary.



##########
hudi-spark-datasource/hudi-spark3.1.x/pom.xml:
##########
@@ -202,6 +202,18 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>org.apache.hudi</groupId>
+      <artifactId>${hudi.spark.common.module}</artifactId>

Review Comment:
   Is this the same dependency as below (`hudi-spark3-common`)?



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSourceStorage.scala:
##########
@@ -61,10 +61,8 @@ class TestCOWDataSourceStorage extends SparkClientFunctionalTestHarness {
   @ParameterizedTest
   @CsvSource(value = Array(
     "true|org.apache.hudi.keygen.SimpleKeyGenerator|_row_key",
-    "true|org.apache.hudi.keygen.ComplexKeyGenerator|_row_key,nation.bytes",

Review Comment:
   #5868 should solve the problem here.



##########
hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java:
##########
@@ -145,6 +148,11 @@ public void init() {
     await().atMost(300, SECONDS).until(this::servicesUp);
     LOG.info(String.format("Waiting for all the containers and services finishes in %d ms",
         System.currentTimeMillis() - currTs));
+    try {
+      Thread.sleep(30000);

Review Comment:
   This could be the reason why Azure CI tests take a longer time to finish.  Would be good to figure out a better way to determine whether the environment is ready for tests.



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/HiveIncrementalPuller.java:
##########
@@ -214,8 +214,6 @@ private void initHiveBeelineProperties(Statement stmt) throws SQLException {
     executeStatement("set mapred.job.queue.name=" + config.yarnQueueName, stmt);
     // Set the inputFormat to HoodieCombineHiveInputFormat
     executeStatement("set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat", stmt);
-    // Allow queries without partition predicate
-    executeStatement("set hive.strict.checks.large.query=false", stmt);

Review Comment:
   We still need to keep this?



##########
packaging/hudi-utilities-bundle/pom.xml:
##########
@@ -409,6 +409,12 @@
       <artifactId>hive-service</artifactId>
       <version>${hive.version}</version>
       <scope>${utilities.bundle.hive.scope}</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   Yeah, @rahil-c let's try to minimize the exclusions.



##########
packaging/hudi-hadoop-mr-bundle/pom.xml:
##########
@@ -29,6 +29,8 @@
   <properties>
     <checkstyle.skip>true</checkstyle.skip>
     <main.basedir>${project.parent.basedir}</main.basedir>
+    <!-- override parquet version to be same as Hive 3.1.2 -->
+    <parquet.version>1.10.1</parquet.version>

Review Comment:
   @rahil-c adding to Udit's point, do we need to build the hudi-hadoop-mr-bundle for Hive 2.x and 3.x differently, based on the dependency version used?



##########
docker/setup_demo.sh:
##########
@@ -16,17 +16,21 @@
 #  See the License for the specific language governing permissions and
 # limitations under the License.
 
+set -e -x -o pipefail
+
 SCRIPT_PATH=$(cd `dirname $0`; pwd)
 HUDI_DEMO_ENV=$1
 WS_ROOT=`dirname $SCRIPT_PATH`
 # restart cluster
-HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml down
+#HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml down

Review Comment:
   Sg.  Let's add a comment in the script as well.



##########
hudi-spark-datasource/hudi-spark3/pom.xml:
##########
@@ -262,7 +261,7 @@
 
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark3-common</artifactId>
+      <artifactId>${hudi.spark.common.module}</artifactId>

Review Comment:
   +1



##########
hudi-utilities/pom.xml:
##########
@@ -233,6 +242,17 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-hive_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>

Review Comment:
   I have the same question.  And it looks like all artifacts are excluded?



##########
hudi-timeline-service/pom.xml:
##########
@@ -117,6 +123,28 @@
       <artifactId>rocksdbjni</artifactId>
     </dependency>
 
+    <!-- Jetty -->
+    <dependency>
+      <groupId>org.eclipse.jetty</groupId>
+      <artifactId>jetty-server</artifactId>
+      <version>${jetty.version}</version>
+    </dependency>

Review Comment:
   +1



##########
hudi-client/hudi-spark-client/pom.xml:
##########
@@ -174,6 +208,12 @@
       <artifactId>awaitility</artifactId>
       <scope>test</scope>
     </dependency>
+    <dependency>

Review Comment:
   What is this used for?  Does it have a compatible license?



##########
hudi-spark-datasource/hudi-spark3/pom.xml:
##########
@@ -24,7 +24,7 @@
   <artifactId>hudi-spark3_2.12</artifactId>
   <version>0.12.0-SNAPSHOT</version>
 
-  <name>hudi-spark3_2.12</name>
+  <name>hudi-spark3_${spark3.scala.binary.version}</name>

Review Comment:
   Same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org