You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/19 14:10:00 UTC

[GitHub] [iceberg] marton-bod opened a new pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

marton-bod opened a new pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631


   Had a discussion earlier with @massdosage and @pvary on this topic, and I think it makes sense to introduce a new `TestHiveShell` test class, which basically does the same things as our current HiveRunner/HiveShell. 
   
   The biggest difference is that the new test class can be instantiated once per test suite and does not require its own runner. This leads to a few important advantages over HiveRunner:
   
   - Quicker: due to not having to restart HS2 after each test method, I've noticed a ~20% speed up when using the `TestHiveShell`. This is expected improve further as we're adding more and more test cases to the StorageHandler tests
   - Flakyness: not having to recreate the HS2 every time also means fewer chances of something going wrong with the metastore connections (i.e. Broken Pipe intermittent flaky problems)
   - Ability to use other Runners: not having to use the HiveStandaloneRunner, we're freed up to use the `Parameterized` runner, so now we can run all tests for all file formats without boilerplate. It reduces code duplication and it's just much much cleaner.
   
   The `TestHiveShell` would use pretty much the same interface as HiveShell, therefore not much of the client code in the test would need to change, making it a smooth transition.
   cc @rdblue 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#issuecomment-712956694


   Thanks for your review @rdblue! I've made some changes based on your suggestions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r511848026



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {

Review comment:
       The metastore client is cached in threadlocal, so each test method will create one and reuse it across its query executions (so not 1 connection per test suite but 1 per test method). However, since the connections are properly closed by calling `session.close()` at the end of each test method, I see no reason to keep the higher pool size. Changed it back to the default size.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508009845



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  private static final DynMethods.StaticMethod FIND_FREE_PORT = DynMethods.builder("findFreePort")
+          .impl("org.apache.hadoop.hive.metastore.utils.MetaStoreUtils")
+          .impl("org.apache.hadoop.hive.metastore.MetaStoreUtils")
+          .buildStatic();
+
+  private final HiveServer2 hs2;
+  private final HiveConf conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    conf.set(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    checkState(isNotBlank(conf.get(HiveConf.ConfVars.METASTOREURIS.varname)),
+            "hive.metastore.uris must be supplied in config. TestHiveShell needs an external metastore to connect to.");

Review comment:
       If `TestMetaStore` were integrated into this class, then I don't think we would need this check.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r511116587



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.hive.TestHiveMetastore;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It spins up both an HS2 and a Metastore instance to work with. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  // We need to use increased pool size in these tests. See: #1620
+  private static final int METASTORE_POOL_SIZE = 15;
+
+  private final TestHiveMetastore metastore;
+  private final HiveServer2 hs2;
+  private final HiveConf hs2Conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    metastore = new TestHiveMetastore();
+    hs2Conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    Preconditions.checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    hs2Conf.verifyAndSet(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    Preconditions.checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    metastore.start(METASTORE_POOL_SIZE);
+    hs2Conf.setVar(HiveConf.ConfVars.METASTOREURIS, metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREURIS));
+    hs2Conf.setVar(HiveConf.ConfVars.METASTOREWAREHOUSE,
+        metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREWAREHOUSE));
+
+    hs2.init(hs2Conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    metastore.stop();
+    started = false;
+  }
+
+  public TestHiveMetastore getMetastore() {

Review comment:
       Nit: we avoid using `get` because it doesn't add much value. Getter methods are usually just named for the field that is returned.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r511119543



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {

Review comment:
       From your answers to the first two, it sounds like there will be one connection per driver that is reused, and planning/execution is single-threaded so we shouldn't have multiple connections from a single test case.
   
   And if the driver or its underlying connection to HMS is reused across sessions, then this should open a single connection to the Hive MetaStore per test suite. Is that correct?
   
   If so, then I think we should be able to turn down the metastore pool size from 15 to 5 because this fixes the problem where a new connection is used for each command.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r507914731



##########
File path: build.gradle
##########
@@ -463,12 +463,10 @@ project(':iceberg-mr') {
     testCompile("org.apache.calcite:calcite-core")
     testCompile("com.esotericsoftware:kryo-shaded:4.0.2")
     testCompile("com.fasterxml.jackson.core:jackson-annotations:2.6.5")
-    testCompile("com.klarna:hiverunner:5.2.1") {
-      exclude group: 'javax.jms', module: 'jms'
+    testCompile("org.apache.hive:hive-service") {
       exclude group: 'org.apache.hive', module: 'hive-exec'
-      exclude group: 'org.codehaus.jettison', module: 'jettison'
-      exclude group: 'org.apache.calcite.avatica'
     }
+    testCompile("org.apache.tez:tez-dag:0.8.4")

Review comment:
       You're right, thanks. It's not required for Hive2, indeed. I've removed the dependency.
   On the other hand, it is needed for Hive3, where `HiveServer2.start()` initializes a `TezSessionPoolManager`, which depends on some Tez code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508007808



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  private static final DynMethods.StaticMethod FIND_FREE_PORT = DynMethods.builder("findFreePort")
+          .impl("org.apache.hadoop.hive.metastore.utils.MetaStoreUtils")
+          .impl("org.apache.hadoop.hive.metastore.MetaStoreUtils")
+          .buildStatic();
+
+  private final HiveServer2 hs2;
+  private final HiveConf conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    conf.set(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    checkState(isNotBlank(conf.get(HiveConf.ConfVars.METASTOREURIS.varname)),
+            "hive.metastore.uris must be supplied in config. TestHiveShell needs an external metastore to connect to.");
+    hs2.init(conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    started = false;
+  }
+
+  public void openSession() {
+    checkState(started, "You have to start TestHiveShell first, before opening a session.");
+    try {
+      SessionHandle sessionHandle = client.getSessionManager().openSession(
+              CLIService.SERVER_VERSION, "", "", "127.0.0.1", Collections.emptyMap());

Review comment:
       Nit: continuing indentation is off in this file




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#issuecomment-712394812


   Thanks, @marton-bod! I like the direction that this is heading. I just have a few questions to understand how this alternative works.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] shardulm94 commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
shardulm94 commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r507894893



##########
File path: build.gradle
##########
@@ -463,12 +463,10 @@ project(':iceberg-mr') {
     testCompile("org.apache.calcite:calcite-core")
     testCompile("com.esotericsoftware:kryo-shaded:4.0.2")
     testCompile("com.fasterxml.jackson.core:jackson-annotations:2.6.5")
-    testCompile("com.klarna:hiverunner:5.2.1") {
-      exclude group: 'javax.jms', module: 'jms'
+    testCompile("org.apache.hive:hive-service") {
       exclude group: 'org.apache.hive', module: 'hive-exec'
-      exclude group: 'org.codehaus.jettison', module: 'jettison'
-      exclude group: 'org.apache.calcite.avatica'
     }
+    testCompile("org.apache.tez:tez-dag:0.8.4")

Review comment:
       Is the tez dependency required? I don't see any imports explicitly referencing tez.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508443343



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java
##########
@@ -112,113 +109,73 @@
 
   // before variables
   protected static TestHiveMetastore metastore;
+  private static TestHiveShell shell;
 
   private TestTables testTables;
 
   public abstract TestTables testTables(Configuration conf, TemporaryFolder tmp) throws IOException;
 
+  @Parameters(name = "fileFormat={0}")
+  public static Iterable<FileFormat> fileFormats() {
+    return ImmutableList.of(FileFormat.PARQUET, FileFormat.ORC, FileFormat.AVRO);
+  }
+
+  @Parameter
+  public FileFormat fileFormat;
 
   @BeforeClass
   public static void beforeClass() {
     metastore = new TestHiveMetastore();
     // We need to use increased pool size in these tests. See: #1620
     metastore.start(METASTORE_POOL_SIZE);
+    shell = new TestHiveShell();
+
+    String metastoreUris = metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREURIS);
+    shell.setHiveConfValue(HiveConf.ConfVars.METASTOREURIS.varname, metastoreUris);
+    String metastoreWarehouse = metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREWAREHOUSE);
+    shell.setHiveConfValue(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, metastoreWarehouse);
+    shell.setHiveConfValue("hive.notification.event.poll.interval", "-1");
+
+    shell.start();

Review comment:
       That's right, `TestHiveShell` always needs an external metastore instance, so yes I agree it makes sense to embed the `TestHiveMetastore`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508646258



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {

Review comment:
       > Do you know how the metastore connections are handled in HS2? 
   
   During query compilation, HS2 calls the Metastore via thrift (using the `ThriftHiveMetastore$Client`), synchronously in the same thread where the Driver performs the other compilation tasks. The created metastore client is cached and reused throughout the life of the Driver execution.
   
   > Does each HS2 service have a connection pool?
   
   The HS2 service we're using for submitting queries (`CLIClient`) does not use a thread pool (it only uses a thread pool when async compilation/execution is enabled, which isn't in our tests or by default).
   
   > Why did you choose to use a session per test case? Is that to avoid some sort of state leak? 
   
   Exactly, to avoid state/configuration leaks, in order to provide a 'clean slate' for each test method. Test methods could therefore go and set their session-level variables individually that they need for their test scenarios, without risks of test config spilling over (which could be hard to debug).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508438534



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  private static final DynMethods.StaticMethod FIND_FREE_PORT = DynMethods.builder("findFreePort")
+          .impl("org.apache.hadoop.hive.metastore.utils.MetaStoreUtils")
+          .impl("org.apache.hadoop.hive.metastore.MetaStoreUtils")
+          .buildStatic();
+
+  private final HiveServer2 hs2;
+  private final HiveConf conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    conf.set(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    checkState(isNotBlank(conf.get(HiveConf.ConfVars.METASTOREURIS.varname)),
+            "hive.metastore.uris must be supplied in config. TestHiveShell needs an external metastore to connect to.");
+    hs2.init(conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    started = false;
+  }
+
+  public void openSession() {
+    checkState(started, "You have to start TestHiveShell first, before opening a session.");
+    try {
+      SessionHandle sessionHandle = client.getSessionManager().openSession(
+              CLIService.SERVER_VERSION, "", "", "127.0.0.1", Collections.emptyMap());
+      session = client.getSessionManager().getSession(sessionHandle);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to open new Hive session: ", e);
+    }
+  }
+
+  public void closeSession() {
+    checkState(session != null, "There is no open session to be closed.");
+    try {
+      session.close();
+      session = null;
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to close Hive session: ", e);
+    }
+  }
+
+  public List<Object[]> executeStatement(String statement) {
+    checkState(session != null,
+            "You have to start TestHiveShell and open a session first, before running a query.");
+    try {
+      OperationHandle handle = client.executeStatement(session.getSessionHandle(), statement, Collections.emptyMap());
+      List<Object[]> resultSet = new ArrayList<>();
+      if (handle.hasResultSet()) {
+        RowSet rowSet;
+        // keep fetching results until we can
+        while ((rowSet = client.fetchResults(handle)) != null && rowSet.numRows() > 0) {
+          for (Object[] row : rowSet) {
+            resultSet.add(row.clone());
+          }
+        }
+      }
+      return resultSet;
+    } catch (HiveSQLException e) {
+      throw new IllegalArgumentException("Failed to execute Hive query '" + statement + "': " + e.getMessage(), e);
+    }
+  }
+
+  public Configuration getHiveConf() {
+    if (session != null) {
+      return session.getHiveConf();
+    } else {
+      return conf;
+    }
+  }
+
+  private HiveConf initializeConf() {
+    HiveConf hiveConf = new HiveConf();
+
+    // Use random port to enable running tests in parallel
+    hiveConf.setIntVar(HiveConf.ConfVars.HIVE_SERVER2_THRIFT_PORT, FIND_FREE_PORT.invoke());

Review comment:
       You're right, there is a chance of collision. I've changed the port number to 0 in the config, which will make HS2 use an ephemeral port.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508009159



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandlerBaseTest.java
##########
@@ -112,113 +109,73 @@
 
   // before variables
   protected static TestHiveMetastore metastore;
+  private static TestHiveShell shell;
 
   private TestTables testTables;
 
   public abstract TestTables testTables(Configuration conf, TemporaryFolder tmp) throws IOException;
 
+  @Parameters(name = "fileFormat={0}")
+  public static Iterable<FileFormat> fileFormats() {
+    return ImmutableList.of(FileFormat.PARQUET, FileFormat.ORC, FileFormat.AVRO);
+  }
+
+  @Parameter
+  public FileFormat fileFormat;
 
   @BeforeClass
   public static void beforeClass() {
     metastore = new TestHiveMetastore();
     // We need to use increased pool size in these tests. See: #1620
     metastore.start(METASTORE_POOL_SIZE);
+    shell = new TestHiveShell();
+
+    String metastoreUris = metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREURIS);
+    shell.setHiveConfValue(HiveConf.ConfVars.METASTOREURIS.varname, metastoreUris);
+    String metastoreWarehouse = metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREWAREHOUSE);
+    shell.setHiveConfValue(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, metastoreWarehouse);
+    shell.setHiveConfValue("hive.notification.event.poll.interval", "-1");
+
+    shell.start();

Review comment:
       Wouldn't it make sense to embed the `TestHiveMetastore` in `TestHiveShell` so that it is always available? Or is there a case where we use `TestHiveShell` without `TestHiveMetastore`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508438703



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  private static final DynMethods.StaticMethod FIND_FREE_PORT = DynMethods.builder("findFreePort")
+          .impl("org.apache.hadoop.hive.metastore.utils.MetaStoreUtils")
+          .impl("org.apache.hadoop.hive.metastore.MetaStoreUtils")
+          .buildStatic();
+
+  private final HiveServer2 hs2;
+  private final HiveConf conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    conf.set(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    checkState(isNotBlank(conf.get(HiveConf.ConfVars.METASTOREURIS.varname)),
+            "hive.metastore.uris must be supplied in config. TestHiveShell needs an external metastore to connect to.");
+    hs2.init(conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    started = false;
+  }
+
+  public void openSession() {
+    checkState(started, "You have to start TestHiveShell first, before opening a session.");
+    try {
+      SessionHandle sessionHandle = client.getSessionManager().openSession(
+              CLIService.SERVER_VERSION, "", "", "127.0.0.1", Collections.emptyMap());

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508439644



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;

Review comment:
       sure, removed the static imports 
   (the `isEmpty` is not needed anymore due to "inhousing" the metastore)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] marton-bod commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
marton-bod commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r511845695



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.hive.TestHiveMetastore;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It spins up both an HS2 and a Metastore instance to work with. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  // We need to use increased pool size in these tests. See: #1620
+  private static final int METASTORE_POOL_SIZE = 15;
+
+  private final TestHiveMetastore metastore;
+  private final HiveServer2 hs2;
+  private final HiveConf hs2Conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    metastore = new TestHiveMetastore();
+    hs2Conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    Preconditions.checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    hs2Conf.verifyAndSet(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    Preconditions.checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    metastore.start(METASTORE_POOL_SIZE);
+    hs2Conf.setVar(HiveConf.ConfVars.METASTOREURIS, metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREURIS));
+    hs2Conf.setVar(HiveConf.ConfVars.METASTOREWAREHOUSE,
+        metastore.hiveConf().getVar(HiveConf.ConfVars.METASTOREWAREHOUSE));
+
+    hs2.init(hs2Conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    metastore.stop();
+    started = false;
+  }
+
+  public TestHiveMetastore getMetastore() {

Review comment:
       sure, renamed it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508005755



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {
+
+  private static final DynMethods.StaticMethod FIND_FREE_PORT = DynMethods.builder("findFreePort")
+          .impl("org.apache.hadoop.hive.metastore.utils.MetaStoreUtils")
+          .impl("org.apache.hadoop.hive.metastore.MetaStoreUtils")
+          .buildStatic();
+
+  private final HiveServer2 hs2;
+  private final HiveConf conf;
+  private CLIService client;
+  private HiveSession session;
+  private boolean started;
+
+  public TestHiveShell() {
+    conf = initializeConf();
+    hs2 = new HiveServer2();
+  }
+
+  public void setHiveConfValue(String key, String value) {
+    checkState(!started, "TestHiveShell has already been started. Cannot set Hive conf anymore.");
+    conf.set(key, value);
+  }
+
+  public void setHiveSessionValue(String key, String value) {
+    checkState(session != null, "There is no open session for setting variables.");
+    try {
+      session.getSessionConf().set(key, value);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to set Hive session variable: ", e);
+    }
+  }
+
+  public void start() {
+    checkState(isNotBlank(conf.get(HiveConf.ConfVars.METASTOREURIS.varname)),
+            "hive.metastore.uris must be supplied in config. TestHiveShell needs an external metastore to connect to.");
+    hs2.init(conf);
+    hs2.start();
+    client = hs2.getServices().stream()
+            .filter(CLIService.class::isInstance)
+            .findFirst()
+            .map(CLIService.class::cast)
+            .get();
+    started = true;
+  }
+
+  public void stop() {
+    if (client != null) {
+      client.stop();
+    }
+    hs2.stop();
+    started = false;
+  }
+
+  public void openSession() {
+    checkState(started, "You have to start TestHiveShell first, before opening a session.");
+    try {
+      SessionHandle sessionHandle = client.getSessionManager().openSession(
+              CLIService.SERVER_VERSION, "", "", "127.0.0.1", Collections.emptyMap());
+      session = client.getSessionManager().getSession(sessionHandle);
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to open new Hive session: ", e);
+    }
+  }
+
+  public void closeSession() {
+    checkState(session != null, "There is no open session to be closed.");
+    try {
+      session.close();
+      session = null;
+    } catch (Exception e) {
+      throw new RuntimeException("Unable to close Hive session: ", e);
+    }
+  }
+
+  public List<Object[]> executeStatement(String statement) {
+    checkState(session != null,
+            "You have to start TestHiveShell and open a session first, before running a query.");
+    try {
+      OperationHandle handle = client.executeStatement(session.getSessionHandle(), statement, Collections.emptyMap());
+      List<Object[]> resultSet = new ArrayList<>();
+      if (handle.hasResultSet()) {
+        RowSet rowSet;
+        // keep fetching results until we can
+        while ((rowSet = client.fetchResults(handle)) != null && rowSet.numRows() > 0) {
+          for (Object[] row : rowSet) {
+            resultSet.add(row.clone());
+          }
+        }
+      }
+      return resultSet;
+    } catch (HiveSQLException e) {
+      throw new IllegalArgumentException("Failed to execute Hive query '" + statement + "': " + e.getMessage(), e);
+    }
+  }
+
+  public Configuration getHiveConf() {
+    if (session != null) {
+      return session.getHiveConf();
+    } else {
+      return conf;
+    }
+  }
+
+  private HiveConf initializeConf() {
+    HiveConf hiveConf = new HiveConf();
+
+    // Use random port to enable running tests in parallel
+    hiveConf.setIntVar(HiveConf.ConfVars.HIVE_SERVER2_THRIFT_PORT, FIND_FREE_PORT.invoke());

Review comment:
       Isn't there an option to get an ephemeral port when creating the socket in the listener? Seems odd to find a free port and try to use it. There's a small chance of collision with new connections on the host.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#issuecomment-717551611


   Looks good. Thanks @marton-bod!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508010813



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;
+
+/**
+ * Test class for running HiveQL queries, essentially acting like a Beeline shell in tests.
+ *
+ * It takes a metastore URL via conf, and spins up an HS2 instance which connects to it. The shell will only accept
+ * queries if it has been previously initialized via {@link #start()}, and a session has been opened via
+ * {@link #openSession()}. Prior to calling {@link #start()}, the shell should first be configured with props that apply
+ * across all test cases by calling {@link #setHiveConfValue(String, String)} ()}. On the other hand, session-level conf
+ * can be applied anytime via {@link #setHiveSessionValue(String, String)} ()}, once we've opened an active session.
+ */
+public class TestHiveShell {

Review comment:
       Do you know how the metastore connections are handled in HS2? Does each HS2 service have a connection pool? Or are connections created per session?
   
   Why did you choose to use a session per test case? Is that to avoid some sort of state leak? Are sessions not supposed to be used concurrently?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1631: Use TestHiveShell to run StorageHandler tests. Use Parameterized runner.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1631:
URL: https://github.com/apache/iceberg/pull/1631#discussion_r508008422



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveShell.java
##########
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hive.service.cli.CLIService;
+import org.apache.hive.service.cli.HiveSQLException;
+import org.apache.hive.service.cli.OperationHandle;
+import org.apache.hive.service.cli.RowSet;
+import org.apache.hive.service.cli.SessionHandle;
+import org.apache.hive.service.cli.session.HiveSession;
+import org.apache.hive.service.server.HiveServer2;
+import org.apache.iceberg.common.DynMethods;
+
+import static org.apache.commons.lang3.StringUtils.isNotBlank;
+import static org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkState;

Review comment:
       We try to avoid using statically imported methods. It's nice to see what class methods are in.
   
   Also, we don't want to introduce a directly dependency on Apache commons. Could you use `isEmpty` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org