You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/19 17:48:14 UTC

[GitHub] [iceberg] maximethebault opened a new issue, #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

maximethebault opened a new issue, #6224:
URL: https://github.com/apache/iceberg/issues/6224

   ### Apache Iceberg version
   
   1.0.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   After upgrading to Iceberg 1.0.0 & Spark 3.3.1 (from 0.13.x & 3.2.x), some of our SQL queries stopped working.
   
   We suspect it may be a Iceberg-related issue as we couldn't reproduce the issue with Hive tables.
   
   ### Stripped-down reproducer
   
   Set-up tables & views
   ```
   val table1 = Seq(("204")).toDF("id")
   table1.createOrReplaceTempView("table1")
   
   val table2_1 = Seq(("204")).toDF("id")
   table2_1.writeTo("dev.table2_1").using("iceberg").createOrReplace()
   
   val table2_2 = Seq(("204")).toDF("id")
   table2_2.createOrReplaceTempView("table2_2")
   
   val table2 = spark.table("dev.table2_1").union(spark.table("table2_2"))
   table2.createOrReplaceTempView("table2")
   ```
   
   Run query
   ```
   SELECT 
           u.*
       FROM 
           table1
       LEFT JOIN
           (
           SELECT 
               id
           FROM 
               table1
           LEFT JOIN
               table2
           USING(id)
           ) u 
       USING(id)
   ```
   
   Results in an exception:
   
   ```
   java.lang.IllegalArgumentException: requirement failed
     at scala.Predef$.require(Predef.scala:268)
     at org.apache.spark.sql.catalyst.plans.logical.View.<init>(basicLogicalOperators.scala:569)
     at org.apache.spark.sql.catalyst.plans.logical.View.copy(basicLogicalOperators.scala:568)
     at org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:604)
     at org.apache.spark.sql.catalyst.plans.logical.View.withNewChildInternal(basicLogicalOperators.scala:565)
     at org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1242)
     at org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1240)
     at org.apache.spark.sql.catalyst.plans.logical.View.withNewChildrenInternal(basicLogicalOperators.scala:565)
     at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:462)
     at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
     at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:461)
     at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.org$apache$spark$sql$catalyst$analysis$Analyzer$AddMetadataColumns$$addMetadataCol(Analyzer.scala:975)
     at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$addMetadataCol$1(Analyzer.scala:975)
   ```
   
   ### Further investigation
   
   If I replace "USING" with classical "ON" clauses, the exception is not thrown.
   
   I think this issue is caused by the fact I'm mixing Iceberg & non-Iceberg tables in the UNION clause.
   
   If I inline table2 in the query, I get a different exception:
   
   ```
   SELECT 
       u.*
   FROM 
       table1
   LEFT JOIN
       (
       SELECT 
           id
       FROM 
           table1
       LEFT JOIN
           ((SELECT id id FROM dev.table2_1 limit 1) UNION (SELECT id FROM table2_2))
       USING(id)
       ) u 
   USING(id)
   ```
   
   results in:
   
   ```
   org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 1 columns;
   'Project [id#1302]
   +- 'Project [id#1302, id#1302]
      +- 'Project [id#1302, id#998]
         +- 'Join LeftOuter, (id#998 = id#1302)
            :- SubqueryAlias table1
            :  +- View (`table1`, [id#998])
            :     +- Project [value#995 AS id#998]
            :        +- LocalRelation [value#995]
            +- 'SubqueryAlias u
               +- 'Project [id#1294, id#1302]
                  +- 'Project [id#1294, id#1302]
                     +- 'Join LeftOuter, (id#1302 = id#1294)
                        :- SubqueryAlias table1
                        :  +- View (`table1`, [id#1302])
                        :     +- Project [value#1296 AS id#1302]
                        :        +- LocalRelation [value#1296]
                        +- 'SubqueryAlias __auto_generated_subquery_name
                           +- 'Distinct
                              +- 'Union false, false
                                 :- GlobalLimit 1
                                 :  +- LocalLimit 1
                                 :     +- Project [_spec_id#1297, _partition#1298, _file#1299, _pos#1300L, _deleted#1301, id#1295 AS id#1294]
                                 :        +- SubqueryAlias spark_catalog.dev.table2_1
                                 :           +- RelationV2[id#1295, _spec_id#1297, _partition#1298, _file#1299, _pos#1300L, _deleted#1301] spark_catalog.dev.table2_1
                                 +- Project [id#1011]
                                    +- SubqueryAlias table2_2
                                       +- View (`table2_2`, [id#1011])
                                          +- Project [value#1008 AS id#1011]
                                             +- LocalRelation [value#1008]
   ```
   
   It looks like some Iceberg metadata columns are visible to Spark during the query analysis and I'm not sure they are supposed to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] maximethebault commented on issue #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

Posted by GitBox <gi...@apache.org>.

maximethebault commented on issue #6224:
URL: https://github.com/apache/iceberg/issues/6224#issuecomment-1356353314

   Thanks for investigating this issue further!
   I'll go ahead and close this issue since it isn't Iceberg-related. I'll make sure to keep an eye on the Spark issue you created.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on issue #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on issue #6224:
URL: https://github.com/apache/iceberg/issues/6224#issuecomment-1356083642

   Hey @maximethebault!
   Thanks for the report. I investigated this and found that that it is actually a bug in Spark 3.3.1+. I have created [SPARK-41557](https://issues.apache.org/jira/browse/SPARK-41557) against the Spark project to track this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] maximethebault closed issue #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION

Posted by GitBox <gi...@apache.org>.

maximethebault closed issue #6224: Spark: regression / query failure with Iceberg 1.0.0 and UNION
URL: https://github.com/apache/iceberg/issues/6224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org