You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/24 09:22:16 UTC

[GitHub] [iceberg] dotjdk opened a new issue, #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

dotjdk opened a new issue, #5625:
URL: https://github.com/apache/iceberg/issues/5625

   ### Apache Iceberg version
   
   0.14.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I am running a spark structured streaming job reading data from Kafka and writing to an Iceberg table partitioned by `days(timestamp)`.
   
   When `IcebergSparkSessionExtensions` are enabled, my job fails with `org.apache.spark.sql.AnalysisException: days(timestamp) ASC NULLS FIRST is not currently supported`.
   
   The only way I can get it to work is by not registering `IcebergSparkSessionExtensions` and enabling `fanout-writer`. When I do that, the data is written to the table, but I get the following entry in the log:
   
   ```
   2022-08-19 06:02:55 WARN  [stream execution thread for Streaming Query [id = 9996dced-e80f-43b6-b241-0533f4df934c, runId = 6b4caf31-db34-4cf1-b88e-8794b49c3a6a]]  o.a.i.spark.source.SparkWriteBuilder - Skipping distribution/ordering: extensions are disabled and spec contains unsupported transforms
   ```
   
   When I enable IcebergSparkSessionExtensions I get the following exception (`fanout-writer` enabled or not): 
   
   I couldn’t find a testcase that triggers this with non-identity partitioning, so I have attached a patch file with a modified version of the TestStructuredStreaming testcase which runs parameterized variations of fanout enabled/disabled and extensions registered or not
   
   | **Extensions** | **fanout-writer** | **Result**                                                                          |
   |----------------|-------------------|-------------------------------------------------------------------------------------|
   | disabled       | enabled           | Pass                                                                                |
   | disabled       | disabled          | Fail: Encountered records that belong to already closed files                       |
   | enabled        | enabled           | Fail: AnalysisException: days(timestamp) ASC NULLS FIRST is not currently supported |
   | enabled        | disabled          | Fail: AnalysisException: days(timestamp) ASC NULLS FIRST is not currently supported |
   
   Patch file with testcase:
   [Non-identity_partitioning_broken_without_fanout_writer.patch.zip](https://github.com/apache/iceberg/files/9414772/Non-identity_partitioning_broken_without_fanout_writer.patch.zip)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dotjdk commented on issue #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

Posted by GitBox <gi...@apache.org>.
dotjdk commented on issue #5625:
URL: https://github.com/apache/iceberg/issues/5625#issuecomment-1230004083

   I have created a PR with the patch applied (slightly modified) https://github.com/apache/iceberg/pull/5660


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #5625:
URL: https://github.com/apache/iceberg/issues/5625#issuecomment-1620849800

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sbernauer commented on issue #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

Posted by GitBox <gi...@apache.org>.
sbernauer commented on issue #5625:
URL: https://github.com/apache/iceberg/issues/5625#issuecomment-1276703232

   Hi  @dotjdk thanks for bringing this up!
   I'm running into the exact same issue (`days(timestamp) ASC NULLS FIRST is not currently supported`) while using the following pyspark structured streaming job.
   Spark 3.3.0, Iceberg 0.14.1
   ```
   from pyspark.sql import SparkSession
   
   spark = SparkSession.builder.appName("write-iceberg-table").getOrCreate()
   
   spark.sql("CREATE SCHEMA IF NOT EXISTS warehouse.bug LOCATION 's3a://warehouse/bug/'")
   spark.sql("CREATE TABLE IF NOT EXISTS warehouse.bug.bug3 (timestamp timestamp, value bigint) USING iceberg PARTITIONED BY (days(timestamp))")
   
   df = spark \
     .readStream \
     .format("rate") \
     .load()
   
   query = df \
       .writeStream \
       .format("iceberg") \
       .outputMode("append") \
       .trigger(processingTime='2 minutes') \
       .option("path", "warehouse.bug.bug3") \
       .option("checkpointLocation", "s3a://warehouse/bug/bug3/checkpoints") \
       .start()
   
   query.awaitTermination()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] sbernauer commented on issue #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

Posted by GitBox <gi...@apache.org>.
sbernauer commented on issue #5625:
URL: https://github.com/apache/iceberg/issues/5625#issuecomment-1329342005

   Re-tested with Iceberg 1.1.0, the problem does persist


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] tvamsikalyan commented on issue #5625: Structured Streaming writes to iceberg table with non-identity partition spec breaks with spark extensions enabled

Posted by GitBox <gi...@apache.org>.
tvamsikalyan commented on issue #5625:
URL: https://github.com/apache/iceberg/issues/5625#issuecomment-1331642092

   one workaround to avoid the problematic code is to set option **use-table-distribution-and-ordering** to **false**.
   For example following worked with above repro steps:
   
   `df.writeStream \
   .format("iceberg") \
   .outputMode("append") \
   .trigger(processingTime='5 seconds') \
   .option("path", "local.db.table") \
   .option("checkpointLocation", "/tmp/warehouse/checkpoints") \
   .option("use-table-distribution-and-ordering", "false") \
   .start()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org