You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/04/26 13:06:55 UTC

[GitHub] [incubator-iceberg] vanliu-tx opened a new issue #970: Writing HadoopTable concurrently leading to losing data

vanliu-tx opened a new issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970

Here I have wrote some test to write one HadoopTable concurrently.

In the first case, there are 30 threads in one java process to write a HadoopTable, each thread commit 100000 then sleep a while. I found in the last meta file, the total records is less then what I committed.

After dig around, I found this issue is addressed by https://github.com/apache/incubator-iceberg/pull/754

However, when I started two separate java processes to writing data into one HadoopTable, the data lost problem is still there.

In this case, I started 20 threads in first java process, and 15 threads in second java process, each thread commit 10 times and 100000 records in each commit. After test finished, there are only 3150000 records in this table, less than expected record count 3500000.

Concurrent commit to HadoopTable in different process is still a problem.

Here is the metadata folder, please check the last metadata file v351.metadata.json.

![image](https://user-images.githubusercontent.com/64360028/80308459-c8987180-8801-11ea-9810-ca62d182874a.png)

[0423_meta.zip](https://github.com/apache/incubator-iceberg/files/4535350/0423_meta.zip)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621364689


   @vanliu-tx, are you sure that there weren't any commit failures? With so many concurrent writes, I wouldn't be surprised if commits exceeded the maximum number of retries and failed. Do you have logs to check this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] vanliu-tx commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

vanliu-tx commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-624619670


   @rdblue I just run my test case (one process with 50 threads commit data files concurrently) with the lastest main branch, however, it failed. 
   
   Some commits are lost as there were 500 commits in total, but only 485 metadata files. See the attached metadata folder zip file.
   [050605_metadata.zip](https://github.com/apache/incubator-iceberg/files/4586652/050605_metadata.zip)
   
   There are some NPE in the log file, and the exception count (16) matches the missing commit count.
   `
   Failed to notify listeners (org.apache.iceberg.SnapshotProducer:308)
   java.lang.NullPointerException
   	at org.apache.iceberg.MergingSnapshotProducer.updateEvent(MergingSnapshotProducer.java:325)
   	at org.apache.iceberg.SnapshotProducer.notifyListeners(SnapshotProducer.java:303)
   	at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:298)
   	at com.tencent.bkdata.iceberg.StreamDemo$1.run(StreamDemo.java:191)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   `
   
   [run.log](https://github.com/apache/incubator-iceberg/files/4586658/run.log)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] vanliu-tx commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

vanliu-tx commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-620331086


   @aokolnychyi, I'm testing on hadoop-2.6.0-cdh5.4.1 deployed on 6 linux (2.6.32.43 kernel) hosts.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] vanliu-tx commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

vanliu-tx commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621652911


   the zip file 0423_meta.zip is in the first post just below the picture.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] vanliu-tx commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

vanliu-tx commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-624408226


   I was on vacation in last five days, will test it today and let you know the result.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621990745


   @vanliu-tx, can you try the PR I just posted?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621365796


   Another thing you can do is turn up the [number of retries](https://iceberg.apache.org/configuration/#table-behavior-properties) on the table and re-running:
   
   ```java
   table.setProperties()
       .set("commit.retry.num-retries", "360")
       .commit();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue edited a comment on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

rdblue edited a comment on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621365796


   Another thing you can do is turn up the [number of retries](https://iceberg.apache.org/configuration/#table-behavior-properties) on the table and re-running:
   
   ```java
   table.updateProperties()
       .set("commit.retry.num-retries", "360")
       .commit();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] vanliu-tx commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

vanliu-tx commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621651729


   @rdblue,  In my test case I have modified the default reties to 20 to avoid some threads failed to commit data. One process has 20 threads to commit data, and each thread commit 10 times, the other process has 15 threads, also commit 10 times for each thread. So there are total 350 data commits in my test, which matches the vX.metadata.json(v1~v351, v1 is for creating table) file count in the metadata attached in the zip file (0423_meta.zip). 
   
   `
   Map<String, String> props = ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, FILE_FORMAT, TableProperties.COMMIT_NUM_RETRIES, "20");
   `
   
   As per https://github.com/apache/incubator-iceberg/pull/754, I think there are chances when one thread in a process enters commit and passes the base != current() check and then another thread in other process commits and refreshes the current metadata. That way, version is updated while the first thread commits, which, in turn, breaks the atomic rename operation as nextVersion will point to a non-existent file. Current fix only synchronized the commit action in one process, it can't synchronize among multi-processes. @aokolnychyi 
   
   Here is an example that commit 9(v10) override commit 8(v9) as these two commits share the same parent-snapshot-id, so 100000 records are lost from commit 9 and so on. Please see the metadata diff in attached text file.
   
   [v8_v9_v10_diff.txt](https://github.com/apache/incubator-iceberg/files/4556623/v8_v9_v10_diff.txt)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] aokolnychyi commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-620239535


   @vanliu-tx, are you testing this on your local file system? What properties does a rename operation have in your environment?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] aokolnychyi commented on issue #970: Writing HadoopTable concurrently leading to losing data

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #970:
URL: https://github.com/apache/incubator-iceberg/issues/970#issuecomment-621541933


   I'd also check if all commits were successful as Ryan said. This can be done by checking the number of snapshots before and after the test. Then we can be sure that we committed the expected number of snapshots. It may also make sense to tune other retry configs such as min/max wait to improve the performance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org