You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "liwei (Jira)" <ji...@apache.org> on 2020/06/06 16:20:00 UTC

[jira] [Commented] (HUDI-944) Support more complete concurrency control when writing data

    [ https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127395#comment-17127395 ] 

liwei commented on HUDI-944:
----------------------------

Thanks so much [~vinoth]

I am so agree with you.

First , I also think (a)  is very practical for many scenario, the only disadvantage is that it may generate a large number of partitions. I am very happy to take up (a) , can i start with adding more tests to  HUDI-839 ?:)

Second, I think  (b) is valuable for hudi. Because now users use hudi client with spark to write large amount of data .This is due to the distributed ability of spark, but this solution need a spark cluster, it is complicated in some scene . Some users need lightweight solution to  just  concurrency write with client.

Third, about (b) i have some rough ideas. “inserts. i.e two transactions inserting same records, only one of them should succeed.” This scenes we also meet. In some database , use bucket or sharding to solve this problem. With bucket  users need  to  first bucket there data with the key using hash partition  algorithm(like kafka built in such algorithm), then different hudi client write the data with different key and will not conflict when concurrency writing data. The shortcoming is bucket data rely on users bucket data before write to hudi. But i also think this solution may makes sense. Because hudi is a storage format now ,do not have service to hash the write data then concurrency writing data to hudi. Is https://issues.apache.org/jira/browse/HUDI-55  relevant ?:)

 

thank you very much,

liwei

> Support more complete  concurrency control when writing data
> ------------------------------------------------------------
>
>                 Key: HUDI-944
>                 URL: https://issues.apache.org/jira/browse/HUDI-944
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: liwei
>            Assignee: liwei
>            Priority: Major
>             Fix For: 0.6.0
>
>
> Now hudi just support write、compaction concurrency control. But some scenario need write concurrency control.Such as two spark job with different data source ,need to write to the same hudi table.
> I have two Proposal:
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)