You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/06/26 06:44:00 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #7910: Spark: Inject `DataSourceV2Relation` when missing

Fokko opened a new pull request, #7910:
URL: https://github.com/apache/iceberg/pull/7910

   When you start a structured streaming query using `.start()`, there will be no `DataSourceV2Relation` reference. When this is missing, we'll just create one since we know the table.
   
   Resolves #7226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [PR] Spark: Inject `DataSourceV2Relation` when missing [iceberg]

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko closed pull request #7910: Spark: Inject `DataSourceV2Relation` when missing
URL: https://github.com/apache/iceberg/pull/7910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on code in PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#discussion_r1313113936


##########
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SetMissingRelation.scala:
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.sources.WriteToMicroBatchDataSource
+
+/**
+ * When running structured streaming using `.start()` instead of `.toTable(...)`,
+ * the relation is empty, and the catalog functions will be missing.
+ * The catalog functions are required when a table is partitioned with a transform.
+ */
+case class SetMissingRelation(spark: SparkSession) extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan match {
+    case p: WriteToMicroBatchDataSource if p.relation.isEmpty =>
+      import spark.sessionState.analyzer.CatalogAndIdentifier
+      val originalMultipartIdentifier = spark.sessionState.sqlParser
+        .parseMultipartIdentifier(p.table.name())

Review Comment:
   I see your concern, but as you can see in the test it works quite well. In the smoke test it passes for `table`, `catalog.schema.table`, and with `s3://bucket/wh/path`. Fixing it in Spark could also work, but then I need more pointers on where to start. I looked into the comment on the issue, but it wasn't directly obvious to me. I think we should fix this, looking at how many people are bumping into this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [PR] Spark: Inject `DataSourceV2Relation` when missing [iceberg]

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#discussion_r1355941817


##########
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SetMissingRelation.scala:
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.sources.WriteToMicroBatchDataSource
+
+/**
+ * When running structured streaming using `.start()` instead of `.toTable(...)`,
+ * the relation is empty, and the catalog functions will be missing.
+ * The catalog functions are required when a table is partitioned with a transform.
+ */
+case class SetMissingRelation(spark: SparkSession) extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan match {
+    case p: WriteToMicroBatchDataSource if p.relation.isEmpty =>
+      import spark.sessionState.analyzer.CatalogAndIdentifier
+      val originalMultipartIdentifier = spark.sessionState.sqlParser
+        .parseMultipartIdentifier(p.table.name())

Review Comment:
   I'd say let's just use the new API in Spark and don't worry about it. I think you already updated the docs to cover that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#issuecomment-1670103008

   @Marcus-Rosti I first wanted to get some feedback on this before also backporting this to older versions of Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#issuecomment-1687286139

   It seems like we are trying to fix a bug in Spark, which is beyond Iceberg control. While I don't mind that cause we would have to wait for a new Spark release otherwise, it would be nice to look for a proper fix in Spark that would work without Iceberg extensions.
   
   Let me take a closer look at the Spark side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#issuecomment-1681413388

   I should have some time to review this week. Sorry for the delay!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#discussion_r1300829017


##########
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SetMissingRelation.scala:
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.sources.WriteToMicroBatchDataSource
+
+/**
+ * When running structured streaming using `.start()` instead of `.toTable(...)`,
+ * the relation is empty, and the catalog functions will be missing.
+ * The catalog functions are required when a table is partitioned with a transform.
+ */
+case class SetMissingRelation(spark: SparkSession) extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan match {
+    case p: WriteToMicroBatchDataSource if p.relation.isEmpty =>
+      import spark.sessionState.analyzer.CatalogAndIdentifier
+      val originalMultipartIdentifier = spark.sessionState.sqlParser
+        .parseMultipartIdentifier(p.table.name())

Review Comment:
   I have doubts that this is generally safe as we assume `table.name()` would include the catalog name. That's currently the case but it feels fragile. I mentioned a potential fix on the Spark side [here](https://github.com/apache/iceberg/issues/7226#issuecomment-1687307732). Let me come back with fresh eyes tomorrow.
   
   @Fokko, let me know what you think on fixing it in Spark.



##########
spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/optimizer/SetMissingRelation.scala:
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.sources.WriteToMicroBatchDataSource
+
+/**
+ * When running structured streaming using `.start()` instead of `.toTable(...)`,
+ * the relation is empty, and the catalog functions will be missing.
+ * The catalog functions are required when a table is partitioned with a transform.
+ */
+case class SetMissingRelation(spark: SparkSession) extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan match {
+    case p: WriteToMicroBatchDataSource if p.relation.isEmpty =>
+      import spark.sessionState.analyzer.CatalogAndIdentifier
+      val originalMultipartIdentifier = spark.sessionState.sqlParser
+        .parseMultipartIdentifier(p.table.name())

Review Comment:
   I have doubts that this is generally safe as we assume `table.name()` would include the catalog name. That's currently the case but it feels fragile. I mentioned a potential fix on the Spark side [here](https://github.com/apache/iceberg/issues/7226#issuecomment-1687307732). Let me come back with fresh eyes tomorrow.
   
   @Fokko, let me know what you think about fixing it in Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Marcus-Rosti commented on pull request #7910: Spark: Inject `DataSourceV2Relation` when missing

Posted by "Marcus-Rosti (via GitHub)" <gi...@apache.org>.

Marcus-Rosti commented on PR #7910:
URL: https://github.com/apache/iceberg/pull/7910#issuecomment-1670026924

   Is this fix only for 3.4+?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org