You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@kyuubi.apache.org by "cfmcgrady (via GitHub)" <gi...@apache.org> on 2023/02/21 09:10:42 UTC

[GitHub] [kyuubi] cfmcgrady opened a new pull request, #4392: [ARROW] Assign a new execution id for arrow-based result

cfmcgrady opened a new pull request, #4392:
URL: https://github.com/apache/kyuubi/pull/4392

   <!--
   Thanks for sending a pull request!
   
   Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/CONTRIBUTING.html
     2. If the PR is related to an issue in https://github.com/apache/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
   -->
   
   ### _Why are the changes needed?_
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you add a feature, you can talk about the use case of it.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   TODO
   
   
   ### _How was this patch tested?_
   - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
   
   - [ ] Add screenshots for manual tests if appropriate
   
   - [ ] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1112776869


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala:
##########
@@ -61,16 +60,15 @@ object RowSet {
   def toTRowSet(
       rows: Seq[Row],
       schema: StructType,
-      protocolVersion: TProtocolVersion,
-      timeZone: ZoneId): TRowSet = {
+      protocolVersion: TProtocolVersion): TRowSet = {
     if (protocolVersion.getValue < TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V6.getValue) {
-      toRowBasedSet(rows, schema, timeZone)

Review Comment:
   `timeZone`is not required after https://github.com/apache/kyuubi/pull/4318 cc @pan3793 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113757779


##########
kyuubi-common/src/test/scala/org/apache/kyuubi/operation/SparkQueryTests.scala:
##########
@@ -433,13 +433,13 @@ trait SparkQueryTests extends SparkDataTypeTests with HiveJDBCTestHelper {
         expectedFormat = "thrift")
       checkStatusAndResultSetFormatHint(
         sql = "set kyuubi.operation.result.format=arrow",
-        expectedFormat = "arrow")
+        expectedFormat = "thrift")

Review Comment:
   This PR changes the behavior, after this PR, the format operation will take effect after the current operation `set kyuubi.operation.result.format=thrift/arrow;`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113759792


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -62,49 +63,49 @@ class ExecuteStatement(
     OperationLog.removeCurrentOperationLog()
   }
 
-  private def executeStatement(): Unit = withLocalProperties {
+  protected def incrementalCollectResult(): Iterator[Any] = {
+    result.toLocalIterator().asScala

Review Comment:
   why not pass "result" as a parameter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113769560


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -62,49 +63,49 @@ class ExecuteStatement(
     OperationLog.removeCurrentOperationLog()
   }
 
-  private def executeStatement(): Unit = withLocalProperties {
+  protected def incrementalCollectResult(): Iterator[Any] = {
+    result.toLocalIterator().asScala

Review Comment:
   updated



##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -62,49 +63,49 @@ class ExecuteStatement(
     OperationLog.removeCurrentOperationLog()
   }
 
-  private def executeStatement(): Unit = withLocalProperties {
+  protected def incrementalCollectResult(): Iterator[Any] = {
+    result.toLocalIterator().asScala
+  }
+
+  protected def fullCollectResult(): Array[_] = {
+    result.collect()
+  }
+
+  protected def takeResult(maxRows: Int): Array[_] = {
+    result.take(maxRows)
+  }
+
+  protected def collectResult(): Unit = {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113757779


##########
kyuubi-common/src/test/scala/org/apache/kyuubi/operation/SparkQueryTests.scala:
##########
@@ -433,13 +433,13 @@ trait SparkQueryTests extends SparkDataTypeTests with HiveJDBCTestHelper {
         expectedFormat = "thrift")
       checkStatusAndResultSetFormatHint(
         sql = "set kyuubi.operation.result.format=arrow",
-        expectedFormat = "arrow")
+        expectedFormat = "thrift")

Review Comment:
   This PR changes the behavior, after this PR, the set format operation will take effect after the current operation `set kyuubi.operation.result.format=thrift/arrow;`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] turboFei commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "turboFei (via GitHub)" <gi...@apache.org>.
turboFei commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1125443844


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/SparkOperation.scala:
##########
@@ -257,8 +257,7 @@ abstract class SparkOperation(session: Session)
             RowSet.toTRowSet(
               taken.toSeq.asInstanceOf[Seq[Row]],
               resultSchema,
-              getProtocolVersion,
-              timeZone)

Review Comment:
   why remove this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1114116112


##########
externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala:
##########
@@ -85,6 +90,27 @@ class SparkArrowbasedOperationSuite extends WithSparkSQLEngine with SparkDataTyp
     }
   }
 
+  ignore("assign a new execution id for arrow-based result") {

Review Comment:
   why disable this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113759276


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -171,3 +172,39 @@ class ExecuteStatement(
       s"__kyuubi_operation_result_format__=$resultFormat",
       s"__kyuubi_operation_result_arrow_timestampAsString__=$timestampAsString")
 }
+
+class ArrowBasedExecuteStatement(
+    session: Session,
+    override val statement: String,
+    override val shouldRunAsync: Boolean,
+    queryTimeout: Long,
+    incrementalCollect: Boolean)
+  extends ExecuteStatement(session, statement, shouldRunAsync, queryTimeout, incrementalCollect) {
+
+  override protected def incrementalCollectResult(): Iterator[Any] = {
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(result)).toLocalIterator
+  }
+
+  override protected def fullCollectResult(): Array[_] = {
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(result)).collect()
+  }
+
+  override protected def takeResult(maxRows: Int): Array[_] = {
+    // this will introduce shuffle and hurt performance
+    val limitedResult = result.limit(maxRows)
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(limitedResult)).collect()
+  }
+
+  /**
+   * assign a new execution id for arrow-based operation.
+   */
+  override protected def collectResult(): Unit = {
+    SQLExecution.withNewExecutionId(result.queryExecution, Some("toArrow")) {

Review Comment:
   maybe "collectAsArrow"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113760064


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -62,49 +63,49 @@ class ExecuteStatement(
     OperationLog.removeCurrentOperationLog()
   }
 
-  private def executeStatement(): Unit = withLocalProperties {
+  protected def incrementalCollectResult(): Iterator[Any] = {
+    result.toLocalIterator().asScala
+  }
+
+  protected def fullCollectResult(): Array[_] = {
+    result.collect()
+  }
+
+  protected def takeResult(maxRows: Int): Array[_] = {
+    result.take(maxRows)
+  }
+
+  protected def collectResult(): Unit = {

Review Comment:
   why not pass result as parameter and return iter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1114143267


##########
externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkArrowbasedOperationSuite.scala:
##########
@@ -85,6 +90,27 @@ class SparkArrowbasedOperationSuite extends WithSparkSQLEngine with SparkDataTyp
     }
   }
 
+  ignore("assign a new execution id for arrow-based result") {

Review Comment:
   enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 closed pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 closed pull request #4392: [ARROW] Assign a new execution id for arrow-based result
URL: https://github.com/apache/kyuubi/pull/4392


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#issuecomment-1440206568

   Thanks, merged to master/1.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] cfmcgrady commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "cfmcgrady (via GitHub)" <gi...@apache.org>.
cfmcgrady commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113760018


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/ExecuteStatement.scala:
##########
@@ -171,3 +172,39 @@ class ExecuteStatement(
       s"__kyuubi_operation_result_format__=$resultFormat",
       s"__kyuubi_operation_result_arrow_timestampAsString__=$timestampAsString")
 }
+
+class ArrowBasedExecuteStatement(
+    session: Session,
+    override val statement: String,
+    override val shouldRunAsync: Boolean,
+    queryTimeout: Long,
+    incrementalCollect: Boolean)
+  extends ExecuteStatement(session, statement, shouldRunAsync, queryTimeout, incrementalCollect) {
+
+  override protected def incrementalCollectResult(): Iterator[Any] = {
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(result)).toLocalIterator
+  }
+
+  override protected def fullCollectResult(): Array[_] = {
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(result)).collect()
+  }
+
+  override protected def takeResult(maxRows: Int): Array[_] = {
+    // this will introduce shuffle and hurt performance
+    val limitedResult = result.limit(maxRows)
+    SparkDatasetHelper.toArrowBatchRdd(convertComplexType(limitedResult)).collect()
+  }
+
+  /**
+   * assign a new execution id for arrow-based operation.
+   */
+  override protected def collectResult(): Unit = {
+    SQLExecution.withNewExecutionId(result.queryExecution, Some("toArrow")) {

Review Comment:
   addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


[GitHub] [kyuubi] pan3793 commented on a diff in pull request #4392: [ARROW] Assign a new execution id for arrow-based result

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #4392:
URL: https://github.com/apache/kyuubi/pull/4392#discussion_r1113252972


##########
externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/schema/RowSet.scala:
##########
@@ -61,16 +60,15 @@ object RowSet {
   def toTRowSet(
       rows: Seq[Row],
       schema: StructType,
-      protocolVersion: TProtocolVersion,
-      timeZone: ZoneId): TRowSet = {
+      protocolVersion: TProtocolVersion): TRowSet = {
     if (protocolVersion.getValue < TProtocolVersion.HIVE_CLI_SERVICE_PROTOCOL_V6.getValue) {
-      toRowBasedSet(rows, schema, timeZone)

Review Comment:
   I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org