You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Gautam <ga...@gmail.com> on 2020/01/28 20:21:43 UTC

Write reliability in Iceberg

Hello Devs,
We are currently working on building out a high write
throughput pipeline with Iceberg where hundreds or thousands of writers
(and thousands of readers) could be accessing a table at any given moment.
We are facing the issue called out by [1]. According to Iceberg's spec on
write reliability [2], the writers depend on an atomic swap, which if fails
should retry. While this may be true there can be instances where the
current write has the latest table state but still fails to perform the
swap or even worse the Reader sees an inconsistency while the write is
being made. To my understanding, this stems from the fact that the current
code [3] that does the swap assumes that the underlying filesystem provides
an atomic rename api ( like hdfs et al) to the version hint file which
keeps track of the current version. If the filesystem does not provide this
then it fails with a fatal error. I think Iceberg should provide some
resiliency here in committing the version once it knows that the latest
table state is still valid and more importantly ensure the readers never
fail during commit. If we agree I can work on adding this into Iceberg.

How are folks handling write/read consistency cases where the underlying fs
doesn't provide atomic apis for file overwrite/rename? We'v outlined the
details in the attached issue#758 [1] .. What do folks think?

Cheers,
-Gautam.

[1] - https://github.com/apache/incubator-iceberg/issues/758
[2] - https://iceberg.incubator.apache.org/reliability/
[3] -
https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220

Re: Write reliability in Iceberg

Posted by Ashish Mehta <me...@gmail.com>.

I commented on the issue, IIUC the issue is due to behavior of rename API,
its about availability of `version-hint.text` file in case of `create with
overwrite flag` which is a filesystem call, after rename API is calling to
rename metadata.
The issue is that readers are facing NPE, when some other process/writer is
trying to commit a snapshot.

I wonder we should add retries on reader side, but we also need
documentation about expectation of `create with overwrite flag` API w.r.t.
guarantees expected from underlying file system.

-Ashish

>

Re: Write reliability in Iceberg

Posted by Gautam <ga...@gmail.com>.

Thanks Ryan and Suds for the suggestions, we are looking into these
options.

We currently don't have any external catalog or locking service and depend
purely on commit retries. Additionally, we don't have any of our meta data
in Hive Metastore, and, we want to leverage the underlying filesystem to
read the table metadata, using the splitable nature of Iceberg's metadata.

I think to be able to keep split planning the way it's done today and
achieve consistency we need to be able to swap metadata consistently we
would need to be able to acquire / release lock (using ZK or otherwise) in
our CustomTableOperations's *doCommit* implementation.

Thanks for the guidance,
-Gautam.


On Tue, Jan 28, 2020 at 2:55 PM Ryan Blue <rb...@netflix.com> wrote:

> Thanks for pointing out those references, suds!
>
> And thanks to Mouli (for writing the doc) and Anton (for writing the test)!
>
> On Tue, Jan 28, 2020 at 2:05 PM suds <su...@gmail.com> wrote:
>
>> We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
>> implemented atomic operation using dynamo optimistic locking. Iceberg
>> codebase has has excellent test case to validate custom implementation.
>>
>> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java
>>
>>
>> On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Gautam,
>>>
>>> Hadoop tables are not intended to be used when the file system doesn't
>>> support atomic rename because of the problems you describe. Atomic rename
>>> is a requirement for correctness in Hadoop tables.
>>>
>>> That is why we also have metastore tables, where some other atomic swap
>>> is used. I strongly recommend using a metastore-based solution when your
>>> underlying file system doesn't support atomic rename, like the Hive
>>> catalog. We've also made it easy to plug in your own metastore using the
>>> `BaseMetastore` classes.
>>>
>>> That said, if you have an idea to make Hadoop tables better, I'm all for
>>> getting it in. But version hint file aside, without atomic rename, two
>>> committers could still conflict and cause one of the commits to be dropped
>>> because the second one to create any particular version's metadata file may
>>> succeed. I don't see a way around this.
>>>
>>> If you don't want to use a metastore, then you could rely on a write
>>> lock provided by ZooKeeper or something similar.
>>>
>>> On Tue, Jan 28, 2020 at 12:22 PM Gautam <ga...@gmail.com> wrote:
>>>
>>>> Hello Devs,
>>>>                      We are currently working on building out a high
>>>> write throughput pipeline with Iceberg where hundreds or thousands of
>>>> writers (and thousands of readers) could be accessing a table at any given
>>>> moment. We are facing the issue called out by [1]. According to Iceberg's
>>>> spec on write reliability [2], the writers depend on an atomic swap, which
>>>> if fails should retry. While this may be true there can be instances where
>>>> the current write has the latest table state but still fails to perform the
>>>> swap or even worse the Reader sees an inconsistency while the write is
>>>> being made. To my understanding, this stems from the fact that the current
>>>> code [3] that does the swap assumes that the underlying filesystem provides
>>>> an atomic rename api ( like hdfs et al) to the version hint file which
>>>> keeps track of the current version. If the filesystem does not provide this
>>>> then it fails with a fatal error. I think Iceberg should provide some
>>>> resiliency here in committing the version once it knows that the latest
>>>> table state is still valid and more importantly ensure the readers never
>>>> fail during commit. If we agree I can work on adding this into Iceberg.
>>>>
>>>> How are folks handling write/read consistency cases where the
>>>> underlying fs doesn't provide atomic apis for file overwrite/rename?  We'v
>>>> outlined the details in the attached issue#758 [1] .. What do folks think?
>>>>
>>>> Cheers,
>>>> -Gautam.
>>>>
>>>> [1] - https://github.com/apache/incubator-iceberg/issues/758
>>>> [2] - https://iceberg.incubator.apache.org/reliability/
>>>> [3] -
>>>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Write reliability in Iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Thanks for pointing out those references, suds!

And thanks to Mouli (for writing the doc) and Anton (for writing the test)!

On Tue, Jan 28, 2020 at 2:05 PM suds <su...@gmail.com> wrote:

> We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
> implemented atomic operation using dynamo optimistic locking. Iceberg
> codebase has has excellent test case to validate custom implementation.
>
> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java
>
>
> On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Gautam,
>>
>> Hadoop tables are not intended to be used when the file system doesn't
>> support atomic rename because of the problems you describe. Atomic rename
>> is a requirement for correctness in Hadoop tables.
>>
>> That is why we also have metastore tables, where some other atomic swap
>> is used. I strongly recommend using a metastore-based solution when your
>> underlying file system doesn't support atomic rename, like the Hive
>> catalog. We've also made it easy to plug in your own metastore using the
>> `BaseMetastore` classes.
>>
>> That said, if you have an idea to make Hadoop tables better, I'm all for
>> getting it in. But version hint file aside, without atomic rename, two
>> committers could still conflict and cause one of the commits to be dropped
>> because the second one to create any particular version's metadata file may
>> succeed. I don't see a way around this.
>>
>> If you don't want to use a metastore, then you could rely on a write lock
>> provided by ZooKeeper or something similar.
>>
>> On Tue, Jan 28, 2020 at 12:22 PM Gautam <ga...@gmail.com> wrote:
>>
>>> Hello Devs,
>>>                      We are currently working on building out a high
>>> write throughput pipeline with Iceberg where hundreds or thousands of
>>> writers (and thousands of readers) could be accessing a table at any given
>>> moment. We are facing the issue called out by [1]. According to Iceberg's
>>> spec on write reliability [2], the writers depend on an atomic swap, which
>>> if fails should retry. While this may be true there can be instances where
>>> the current write has the latest table state but still fails to perform the
>>> swap or even worse the Reader sees an inconsistency while the write is
>>> being made. To my understanding, this stems from the fact that the current
>>> code [3] that does the swap assumes that the underlying filesystem provides
>>> an atomic rename api ( like hdfs et al) to the version hint file which
>>> keeps track of the current version. If the filesystem does not provide this
>>> then it fails with a fatal error. I think Iceberg should provide some
>>> resiliency here in committing the version once it knows that the latest
>>> table state is still valid and more importantly ensure the readers never
>>> fail during commit. If we agree I can work on adding this into Iceberg.
>>>
>>> How are folks handling write/read consistency cases where the underlying
>>> fs doesn't provide atomic apis for file overwrite/rename?  We'v outlined
>>> the details in the attached issue#758 [1] .. What do folks think?
>>>
>>> Cheers,
>>> -Gautam.
>>>
>>> [1] - https://github.com/apache/incubator-iceberg/issues/758
>>> [2] - https://iceberg.incubator.apache.org/reliability/
>>> [3] -
>>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Write reliability in Iceberg

Posted by suds <su...@gmail.com>.

We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
implemented atomic operation using dynamo optimistic locking. Iceberg
codebase has has excellent test case to validate custom implementation.
https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java


On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Gautam,
>
> Hadoop tables are not intended to be used when the file system doesn't
> support atomic rename because of the problems you describe. Atomic rename
> is a requirement for correctness in Hadoop tables.
>
> That is why we also have metastore tables, where some other atomic swap is
> used. I strongly recommend using a metastore-based solution when your
> underlying file system doesn't support atomic rename, like the Hive
> catalog. We've also made it easy to plug in your own metastore using the
> `BaseMetastore` classes.
>
> That said, if you have an idea to make Hadoop tables better, I'm all for
> getting it in. But version hint file aside, without atomic rename, two
> committers could still conflict and cause one of the commits to be dropped
> because the second one to create any particular version's metadata file may
> succeed. I don't see a way around this.
>
> If you don't want to use a metastore, then you could rely on a write lock
> provided by ZooKeeper or something similar.
>
> On Tue, Jan 28, 2020 at 12:22 PM Gautam <ga...@gmail.com> wrote:
>
>> Hello Devs,
>>                      We are currently working on building out a high
>> write throughput pipeline with Iceberg where hundreds or thousands of
>> writers (and thousands of readers) could be accessing a table at any given
>> moment. We are facing the issue called out by [1]. According to Iceberg's
>> spec on write reliability [2], the writers depend on an atomic swap, which
>> if fails should retry. While this may be true there can be instances where
>> the current write has the latest table state but still fails to perform the
>> swap or even worse the Reader sees an inconsistency while the write is
>> being made. To my understanding, this stems from the fact that the current
>> code [3] that does the swap assumes that the underlying filesystem provides
>> an atomic rename api ( like hdfs et al) to the version hint file which
>> keeps track of the current version. If the filesystem does not provide this
>> then it fails with a fatal error. I think Iceberg should provide some
>> resiliency here in committing the version once it knows that the latest
>> table state is still valid and more importantly ensure the readers never
>> fail during commit. If we agree I can work on adding this into Iceberg.
>>
>> How are folks handling write/read consistency cases where the underlying
>> fs doesn't provide atomic apis for file overwrite/rename?  We'v outlined
>> the details in the attached issue#758 [1] .. What do folks think?
>>
>> Cheers,
>> -Gautam.
>>
>> [1] - https://github.com/apache/incubator-iceberg/issues/758
>> [2] - https://iceberg.incubator.apache.org/reliability/
>> [3] -
>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Write reliability in Iceberg

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Hi Gautam,

Hadoop tables are not intended to be used when the file system doesn't
support atomic rename because of the problems you describe. Atomic rename
is a requirement for correctness in Hadoop tables.

That is why we also have metastore tables, where some other atomic swap is
used. I strongly recommend using a metastore-based solution when your
underlying file system doesn't support atomic rename, like the Hive
catalog. We've also made it easy to plug in your own metastore using the
`BaseMetastore` classes.

That said, if you have an idea to make Hadoop tables better, I'm all for
getting it in. But version hint file aside, without atomic rename, two
committers could still conflict and cause one of the commits to be dropped
because the second one to create any particular version's metadata file may
succeed. I don't see a way around this.

If you don't want to use a metastore, then you could rely on a write lock
provided by ZooKeeper or something similar.

On Tue, Jan 28, 2020 at 12:22 PM Gautam <ga...@gmail.com> wrote:

> Hello Devs,
>                      We are currently working on building out a high write
> throughput pipeline with Iceberg where hundreds or thousands of writers
> (and thousands of readers) could be accessing a table at any given moment.
> We are facing the issue called out by [1]. According to Iceberg's spec on
> write reliability [2], the writers depend on an atomic swap, which if fails
> should retry. While this may be true there can be instances where the
> current write has the latest table state but still fails to perform the
> swap or even worse the Reader sees an inconsistency while the write is
> being made. To my understanding, this stems from the fact that the current
> code [3] that does the swap assumes that the underlying filesystem provides
> an atomic rename api ( like hdfs et al) to the version hint file which
> keeps track of the current version. If the filesystem does not provide this
> then it fails with a fatal error. I think Iceberg should provide some
> resiliency here in committing the version once it knows that the latest
> table state is still valid and more importantly ensure the readers never
> fail during commit. If we agree I can work on adding this into Iceberg.
>
> How are folks handling write/read consistency cases where the underlying
> fs doesn't provide atomic apis for file overwrite/rename?  We'v outlined
> the details in the attached issue#758 [1] .. What do folks think?
>
> Cheers,
> -Gautam.
>
> [1] - https://github.com/apache/incubator-iceberg/issues/758
> [2] - https://iceberg.incubator.apache.org/reliability/
> [3] -
> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220
>

-- 
Ryan Blue
Software Engineer
Netflix