You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/09/07 13:06:19 UTC

[GitHub] [iceberg] cccs-eric opened a new issue, #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

cccs-eric opened a new issue, #5719:
URL: https://github.com/apache/iceberg/issues/5719

   ### Apache Iceberg version
   
   0.14.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have a Spark job that used to run fine until I upgraded my stack to Spark 3.3.0/Iceberg 0.14.  The job has been running flawlessly and since the upgrade, I get an error (`org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!`) ~80% of the time.  The Spark job does the following:
   
   1. It starts by reading a bunch of files from a location
   2. It processes those files and writes an output to an Iceberg table partitioned by day and a boolean attribute.  The job runs daily, so in order to be able to reprocess a day in the past, it writes to the table using `overwritePartitions`.
   3. Once writing is done, it calls
   ```
   # partition_date is the day being processed
   CALL {catalog}.system.rewrite_data_files(
       table => '{namespace}.{table}',
       options => map('target-file-size-bytes', '54525952'),
       where => 'timeperiod_loadedBy = "{partition_date.strftime("%Y-%m-%d")}"'
   )
   ```
   and then
   ```
   CALL {catalog}.system.expire_snapshots(
       table => '{namespace}.{table}',
       retain_last => 1,
       older_than => TIMESTAMP '{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
       max_concurrent_deletes => 10,
       stream_results => true)
   ```
   The error happens during `expire_snapshots`, while doing a `broadcast exchange` stage.  I'm attaching a screenshot showing a successful run:
   <img width="1898" alt="stages" src="https://user-images.githubusercontent.com/56447460/188884274-3e8ee4fc-75ed-4795-a3e3-454d44de9ec2.png">
   
   Since the job overwrites the partition, I can re-run it when it fails and it will "eventually" work.  Could it be related to a timing/concurrent issue?
   
   Here is the stack trace:
   ```
   22/09/06 13:14:19 WARN TaskSetManager: Lost task 370.0 in stage 7.0 (TID 51769) (10.244.104.4 executor 4): org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:224)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:192)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.CloseableIterator$2.hasNext(CloseableIterator.java:71)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.io.ClosingIterator.hasNext(ClosingIterator.java:40)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.scheduler.Task.run(Task.scala:136)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at java.base/java.lang.Thread.run(Thread.java:829)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - Caused by: java.io.IOException: Invalid sync!
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:319)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	at org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:213)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - 	... 27 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] cccs-eric closed issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

cccs-eric closed issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14
URL: https://github.com/apache/iceberg/issues/5719


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] cccs-eric commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

cccs-eric commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1254846679

   The problem is related to the hadoop-azure library, not to Iceberg.  A bug was introduced in version in version 3.3.2.  I have yet to report it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

nastra commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1239470510

   In terms of Avro versions between Iceberg 0.13.2 & 0.14.0 we only upgraded from Avro **1.10.1** to **1.10.2** and I don't see anything suspicious in https://github.com/apache/avro/compare/release-1.10.1...release-1.10.2.
   
   Avro-related changes that made it into 0.14.0 are:
   * https://github.com/apache/iceberg/commit/013d4ad612204beeeb39fd2d6d84d15afecb8c47
   * https://github.com/apache/iceberg/commit/13f0a7102c49b3d75bdd0323fbabd76fa3b2feb1
   * https://github.com/apache/iceberg/commit/d397ae3f1b88e9a8bcd5d687e701c2d04f130b67
   * https://github.com/apache/iceberg/commit/16158218464f6808f1c703962af435459b3ace73
   * https://github.com/apache/iceberg/commit/da63c843991431af374c4b9b5530719d5b80f49e
   but I don't see anything that stands out. Did you by any chance update anything else (table/catalog properties, compression/ ...)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

nastra commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1239379412

   @cccs-eric can you please share which version of Iceberg and Spark you were previously using where things were working?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] cccs-eric commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

cccs-eric commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1239649434

   >   Did you by any chance update anything else (table/catalog properties, compression/ ...)?
   Not that I'm aware of.  It seems like this workload is the only one that has this problem, and we have many others that do `rewrite_data_files`.  So maybe this job does something different.  When a `rewrite_data_files` using bin packing occurs, is Iceberg concatenating Avro files (manifests)?  In my case, data is in parquet and metadata is in Avro.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] cccs-eric commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

cccs-eric commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1248037880

@nastra I have been troubleshooting this issue and I now know more about what is going on. Here are the latest facts:

1. `{catalog}.system.expire_snapshots` is the problematic call, not `{catalog}.system.rewrite_data_files`. I can reproduce the problem by only calling `expire_snapshots` in a loop. So the take away here is that the manifest files are not changing between runs. My test calls `expire_snapshots` in a loop and out of let's say 10 runs, I'll get 1 or more failures, which means that the manifest files are valid and the problem occurs "randomly".
2. My Iceberg tables are stored in Azure datalake (gen2, DFS) and Spark is using the hadoop-azure package to read/write from the datalake. I had a good look at the hadoop-azure code and they do pre-fetching (or lookaheads) for files using multi-threading. To me, this looks suspicious for my issue and what appears to be a random issue could very well be related to a concurrency problem.
3. I was able to prove that moving to Spark 3.3.0 (from 3.2.1) and Iceberg 0.14 caused the problem since I can run my test just fine (85 consecutive successful runs) using the previous Spark build (3.2.1-0.13.1), against the same table/same files. As soon as I switch to the new build, the error occurs within a few runs.

I think the problem is related to hadoop-azure and NOT iceberg, but I have yet to prove it. The Spark upgrade to 3.3.0 includes an upgrade of the hadoop-azure library (from 3.3.1 to 3.3.2) and some modifications to the prefetch code was done during the 3.3.2 release. I have yet to go through all those changes, I was hoping I could isolate and reproduce the problem without using Spark, by only using the pure Java API). Which brings me to the following question:

- When calling [Table.expireSnapshots()](https://github.com/apache/iceberg/blob/0f0b1af2ebbc5693ff6dc8049a1e3490540311f8/core/src/main/java/org/apache/iceberg/BaseTable.java#L213), I have realized that not a single Avro file is being read from the datalake. But when the same [Spark action](https://github.com/apache/iceberg/blob/0f0b1af2ebbc5693ff6dc8049a1e3490540311f8/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L49-L63) is being called, all the Avro manifest files are being read and this is where the problem happens. Don't get me wrong, reading the manifests should not trigger a problem, but I'd like to understand why the Spark implementation is different from the pure Java one?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] cccs-eric commented on issue #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Posted by GitBox <gi...@apache.org>.

cccs-eric commented on issue #5719:
URL: https://github.com/apache/iceberg/issues/5719#issuecomment-1239391661

   @nastra Spark 3.2.1 and Iceberg 0.13.2.  And we also upgraded to Java 11 (from Java 8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org