You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@kyuubi.apache.org by "wForget (via GitHub)" <gi...@apache.org> on 2024/01/29 06:21:44 UTC

[PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes. [kyuubi]

wForget opened a new pull request, #6025:
URL: https://github.com/apache/kyuubi/pull/6025

   # :mag: Description
   ## Issue References ๐Ÿ”—
   <!-- Append the issue number after #. If there is no issue for you to link create one or -->
   <!-- If there are no issues to link, please provide details here. -->
   
   This pull request fixes #6024
   
   ## Describe Your Solution ๐Ÿ”ง
   
   Insert crc checksum observer after all `Project` nodes to compare data at all stage of SQL.
   
   
   ## Types of changes :bookmark:
   <!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
   - [ ] Bugfix (non-breaking change which fixes an issue)
   - [X] New feature (non-breaking change which adds functionality)
   - [ ] Breaking change (fix or feature that would cause existing functionality to change)
   
   ## Test Plan ๐Ÿงช
   
   #### Behavior Without This Pull Request :coffin:
   
   
   #### Behavior With This Pull Request :tada:
   
   
   #### Related Unit Tests
   
   org.apache.spark.sql.observe.InsertChecksumObserverAfterProjectSuite
   
   ---
   
   # Checklist ๐Ÿ“
   <!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
   <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
   
   - [X] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
   
   **Be nice. Be informative.**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.
ulysses-you commented on code in PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#discussion_r1469515069


##########
extensions/spark/kyuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/observe/InsertChecksumObserverAfterProject.scala:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kyuubi.sql.observe
+
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast, Crc32, Expression, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Count, Sum}
+import org.apache.spark.sql.catalyst.plans.logical.{CollectMetrics, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
+import org.apache.spark.sql.types.{BinaryType, ByteType, DecimalType, IntegerType, LongType, ShortType, StringType}
+
+import org.apache.kyuubi.sql.KyuubiSQLConf.INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED
+import org.apache.kyuubi.sql.observe.InsertChecksumObserverAfterProject._
+
+case class InsertChecksumObserverAfterProject(session: SparkSession) extends Rule[LogicalPlan] {
+
+  private val INSERT_COLLECT_METRICS_TAG = TreeNodeTag[Unit]("__INSERT_COLLECT_METRICS_TAG")
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) {
+      plan resolveOperatorsUp {
+        case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty =>

Review Comment:
   why we only add collect metrics for project ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "codecov-commenter (via GitHub)" <gi...@apache.org>.
codecov-commenter commented on PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#issuecomment-1914126647

   ## [Codecov](https://app.codecov.io/gh/apache/kyuubi/pull/6025?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Comparison is base [(`f531a37`)](https://app.codecov.io/gh/apache/kyuubi/commit/f531a372e1ad7d3b7fdb3eea0d2cc4c85e960825?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.07% compared to head [(`8506283`)](https://app.codecov.io/gh/apache/kyuubi/pull/6025?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) 61.08%.
   > Report is 2 commits behind head on master.
   
   
   <details><summary>Additional details and impacted files</summary>
   
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #6025      +/-   ##
   ============================================
   + Coverage     61.07%   61.08%   +0.01%     
     Complexity       23       23              
   ============================================
     Files           623      623              
     Lines         37090    37103      +13     
     Branches       5028     5029       +1     
   ============================================
   + Hits          22653    22666      +13     
   - Misses        11986    11992       +6     
   + Partials       2451     2445       -6     
   ```
   
   
   
   </details>
   
   [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/kyuubi/pull/6025?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).   
   :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#discussion_r1469520835


##########
extensions/spark/kyuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/observe/InsertChecksumObserverAfterProject.scala:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kyuubi.sql.observe
+
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast, Crc32, Expression, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Count, Sum}
+import org.apache.spark.sql.catalyst.plans.logical.{CollectMetrics, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
+import org.apache.spark.sql.types.{BinaryType, ByteType, DecimalType, IntegerType, LongType, ShortType, StringType}
+
+import org.apache.kyuubi.sql.KyuubiSQLConf.INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED
+import org.apache.kyuubi.sql.observe.InsertChecksumObserverAfterProject._
+
+case class InsertChecksumObserverAfterProject(session: SparkSession) extends Rule[LogicalPlan] {
+
+  private val INSERT_COLLECT_METRICS_TAG = TreeNodeTag[Unit]("__INSERT_COLLECT_METRICS_TAG")
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) {
+      plan resolveOperatorsUp {
+        case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty =>

Review Comment:
   > why we only add collect metrics for project ?
   
   Most data inconsistencies are caused by project, such as using udf/udaf or type conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#discussion_r1469537172


##########
extensions/spark/kyuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/observe/InsertChecksumObserverAfterProject.scala:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kyuubi.sql.observe
+
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast, Crc32, Expression, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Count, Sum}
+import org.apache.spark.sql.catalyst.plans.logical.{CollectMetrics, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
+import org.apache.spark.sql.types.{BinaryType, ByteType, DecimalType, IntegerType, LongType, ShortType, StringType}
+
+import org.apache.kyuubi.sql.KyuubiSQLConf.INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED
+import org.apache.kyuubi.sql.observe.InsertChecksumObserverAfterProject._
+
+case class InsertChecksumObserverAfterProject(session: SparkSession) extends Rule[LogicalPlan] {
+
+  private val INSERT_COLLECT_METRICS_TAG = TreeNodeTag[Unit]("__INSERT_COLLECT_METRICS_TAG")
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) {
+      plan resolveOperatorsUp {
+        case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty =>

Review Comment:
   How about we remove AfterProject from the rule name and support more nodes in the future?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.
ulysses-you commented on code in PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#discussion_r1469478857


##########
extensions/spark/kyuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/observe/InsertChecksumObserverAfterProject.scala:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kyuubi.sql.observe
+
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast, Crc32, Expression, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Count, Sum}
+import org.apache.spark.sql.catalyst.plans.logical.{CollectMetrics, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
+import org.apache.spark.sql.types.{BinaryType, ByteType, DecimalType, IntegerType, LongType, ShortType, StringType}
+
+import org.apache.kyuubi.sql.KyuubiSQLConf.INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED
+import org.apache.kyuubi.sql.observe.InsertChecksumObserverAfterProject._
+
+case class InsertChecksumObserverAfterProject(session: SparkSession) extends Rule[LogicalPlan] {
+
+  private val INSERT_COLLECT_METRICS_TAG = TreeNodeTag[Unit]("__INSERT_COLLECT_METRICS_TAG")
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) {
+      plan resolveOperatorsUp {
+        case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty =>

Review Comment:
   What happens if the plan is for writing ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org


Re: [PR] [KYUUBI #6024] Insert crc checksum observer after all project nodes [kyuubi]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #6025:
URL: https://github.com/apache/kyuubi/pull/6025#discussion_r1469503811


##########
extensions/spark/kyuubi-extension-spark-3-5/src/main/scala/org/apache/kyuubi/sql/observe/InsertChecksumObserverAfterProject.scala:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kyuubi.sql.observe
+
+import java.util.concurrent.atomic.AtomicLong
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast, Crc32, Expression, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.expressions.aggregate.{Count, Sum}
+import org.apache.spark.sql.catalyst.plans.logical.{CollectMetrics, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreeNodeTag
+import org.apache.spark.sql.types.{BinaryType, ByteType, DecimalType, IntegerType, LongType, ShortType, StringType}
+
+import org.apache.kyuubi.sql.KyuubiSQLConf.INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED
+import org.apache.kyuubi.sql.observe.InsertChecksumObserverAfterProject._
+
+case class InsertChecksumObserverAfterProject(session: SparkSession) extends Rule[LogicalPlan] {
+
+  private val INSERT_COLLECT_METRICS_TAG = TreeNodeTag[Unit]("__INSERT_COLLECT_METRICS_TAG")
+
+  override def apply(plan: LogicalPlan): LogicalPlan = {
+    if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) {
+      plan resolveOperatorsUp {
+        case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty =>

Review Comment:
   > What happens if the plan is for writing ?
   
   Just add observer after all project nodes, I think there will also be project node before writing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@kyuubi.apache.org
For additional commands, e-mail: notifications-help@kyuubi.apache.org