You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2017/12/07 00:20:13 UTC

Iceberg table format

Hi everyone,

I mentioned in the sync-up this morning that I’d send out an introduction
to the table format we’re working on, which we’re calling Iceberg.

For anyone that wasn’t around here’s the background: there are several
problems with how we currently manage data files to make up a table in the
Hadoop ecosystem. The one that came up today was that you can’t actually
update a table atomically to, for example, rewrite a file and safely delete
records. That’s because Hive tables track what files are currently visible
by listing partition directories, and we don’t have (or want) transactions
for changes in Hadoop file systems. This means that you can’t actually have
isolated commits to a table and the result is that *query results from Hive
tables can be wrong*, though rarely in practice.

The problems with current tables are caused primarily by keeping state
about what files are in or not in a table in the file system. As I said,
one problem is that there are no transactions but you also have to list
directories to plan jobs (bad on S3) and rename files from a temporary
location to a final location (really, really bad on S3).

To avoid these problems we’ve been building the Iceberg format that tracks
tracks every file in a table instead of tracking directories. Iceberg
maintains snapshots of all the files in a dataset and atomically swaps
snapshots and other metadata to commit. There are a few benefits to doing
it this way:

   - *Snapshot isolation*: Readers always use a consistent snapshot of the
   table, without needing to hold a lock. All updates are atomic.
   - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
   plan a job, reading a snapshot requires O(1) RPC calls
   - *Distributed planning*: File pruning and predicate push-down is
   distributed to jobs, removing the metastore bottleneck
   - *Version history and rollback*: Table snapshots are kept around and it
   is possible to roll back if a job has a bug and commits
   - *Finer granularity partitioning*: Distributed planning and O(1) RPC
   calls remove the current barriers to finer-grained partitioning

We’re also taking this opportunity to fix a few other problems:

   - Schema evolution: columns are tracked by ID to support add/drop/rename
   - Types: a core set of types, thoroughly tested to work consistently
   across all of the supported data formats
   - Metrics: cost-based optimization metrics are kept in the snapshots
   - Portable spec: tables should not be tied to Java and should have a
   simple and clear specification for other implementers

We have the core library to track files done, along with most of a
specification, and a Spark datasource (v2) that can read Iceberg tables.
I’ll be working on the write path next and we plan to build a Presto
implementation soon.

I think this should be useful to others and it would be great to
collaborate with anyone that is interested.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

The reason this came up was the discussion on how to delete data. Delete
markers don't really fit with the file format layer, but should go
somewhere. I think it makes sense to handle them at the table level.

Iceberg is a separate project, but one that a lot of people on this list
are probably interested in. The table format uses Avro and Parquet as file
formats to start with.

rb

On Fri, Dec 8, 2017 at 10:49 AM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> Very Interesting. I was wondering how this is related to parquet. Are we
> talking about an abstraction on top of parquet which allows capabilities
> like Versioning, Snapshotting etc? Or a completely different format (may be
> API compatible with parquet)?
>
> - Rahul
>
> On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'm working on getting the code out to our open source github org,
>> probably
>> early next week. I'll set up a mailing list for it as well.
>>
>> rb
>>
>> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > Sounds super interesting. Would love to collaborate on this. Do you
>> have a
>> > repo or mailing list where you are working on this?
>> >
>> >
>> >
>> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
>> > wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> I mentioned in the sync-up this morning that I’d send out an
>> introduction
>> >> to the table format we’re working on, which we’re calling Iceberg.
>> >>
>> >> For anyone that wasn’t around here’s the background: there are several
>> >> problems with how we currently manage data files to make up a table in
>> the
>> >> Hadoop ecosystem. The one that came up today was that you can’t
>> actually
>> >> update a table atomically to, for example, rewrite a file and safely
>> >> delete
>> >> records. That’s because Hive tables track what files are currently
>> visible
>> >> by listing partition directories, and we don’t have (or want)
>> transactions
>> >> for changes in Hadoop file systems. This means that you can’t actually
>> >> have
>> >> isolated commits to a table and the result is that *query results from
>> >> Hive
>> >> tables can be wrong*, though rarely in practice.
>> >>
>> >> The problems with current tables are caused primarily by keeping state
>> >> about what files are in or not in a table in the file system. As I
>> said,
>> >> one problem is that there are no transactions but you also have to list
>> >> directories to plan jobs (bad on S3) and rename files from a temporary
>> >> location to a final location (really, really bad on S3).
>> >>
>> >> To avoid these problems we’ve been building the Iceberg format that
>> tracks
>> >> tracks every file in a table instead of tracking directories. Iceberg
>> >> maintains snapshots of all the files in a dataset and atomically swaps
>> >> snapshots and other metadata to commit. There are a few benefits to
>> doing
>> >> it this way:
>> >>
>> >>    - *Snapshot isolation*: Readers always use a consistent snapshot of
>> the
>> >>    table, without needing to hold a lock. All updates are atomic.
>> >>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a
>> table
>> >> to
>> >>    plan a job, reading a snapshot requires O(1) RPC calls
>> >>    - *Distributed planning*: File pruning and predicate push-down is
>> >>    distributed to jobs, removing the metastore bottleneck
>> >>    - *Version history and rollback*: Table snapshots are kept around
>> and
>> >> it
>> >>    is possible to roll back if a job has a bug and commits
>> >>    - *Finer granularity partitioning*: Distributed planning and O(1)
>> RPC
>> >>    calls remove the current barriers to finer-grained partitioning
>> >>
>> >> We’re also taking this opportunity to fix a few other problems:
>> >>
>> >>    - Schema evolution: columns are tracked by ID to support
>> >> add/drop/rename
>> >>    - Types: a core set of types, thoroughly tested to work consistently
>> >>    across all of the supported data formats
>> >>    - Metrics: cost-based optimization metrics are kept in the snapshots
>> >>    - Portable spec: tables should not be tied to Java and should have a
>> >>    simple and clear specification for other implementers
>> >>
>> >> We have the core library to track files done, along with most of a
>> >> specification, and a Spark datasource (v2) that can read Iceberg
>> tables.
>> >> I’ll be working on the write path next and we plan to build a Presto
>> >> implementation soon.
>> >>
>> >> I think this should be useful to others and it would be great to
>> >> collaborate with anyone that is interested.
>> >>
>> >> rb
>> >> 
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>> >>
>> >
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by rahul challapalli <ch...@gmail.com>.

Very Interesting. I was wondering how this is related to parquet. Are we
talking about an abstraction on top of parquet which allows capabilities
like Versioning, Snapshotting etc? Or a completely different format (may be
API compatible with parquet)?

- Rahul

On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'm working on getting the code out to our open source github org, probably
> early next week. I'll set up a mailing list for it as well.
>
> rb
>
> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Sounds super interesting. Would love to collaborate on this. Do you have
> a
> > repo or mailing list where you are working on this?
> >
> >
> >
> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> >> Hi everyone,
> >>
> >> I mentioned in the sync-up this morning that I’d send out an
> introduction
> >> to the table format we’re working on, which we’re calling Iceberg.
> >>
> >> For anyone that wasn’t around here’s the background: there are several
> >> problems with how we currently manage data files to make up a table in
> the
> >> Hadoop ecosystem. The one that came up today was that you can’t actually
> >> update a table atomically to, for example, rewrite a file and safely
> >> delete
> >> records. That’s because Hive tables track what files are currently
> visible
> >> by listing partition directories, and we don’t have (or want)
> transactions
> >> for changes in Hadoop file systems. This means that you can’t actually
> >> have
> >> isolated commits to a table and the result is that *query results from
> >> Hive
> >> tables can be wrong*, though rarely in practice.
> >>
> >> The problems with current tables are caused primarily by keeping state
> >> about what files are in or not in a table in the file system. As I said,
> >> one problem is that there are no transactions but you also have to list
> >> directories to plan jobs (bad on S3) and rename files from a temporary
> >> location to a final location (really, really bad on S3).
> >>
> >> To avoid these problems we’ve been building the Iceberg format that
> tracks
> >> tracks every file in a table instead of tracking directories. Iceberg
> >> maintains snapshots of all the files in a dataset and atomically swaps
> >> snapshots and other metadata to commit. There are a few benefits to
> doing
> >> it this way:
> >>
> >>    - *Snapshot isolation*: Readers always use a consistent snapshot of
> the
> >>    table, without needing to hold a lock. All updates are atomic.
> >>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table
> >> to
> >>    plan a job, reading a snapshot requires O(1) RPC calls
> >>    - *Distributed planning*: File pruning and predicate push-down is
> >>    distributed to jobs, removing the metastore bottleneck
> >>    - *Version history and rollback*: Table snapshots are kept around and
> >> it
> >>    is possible to roll back if a job has a bug and commits
> >>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
> >>    calls remove the current barriers to finer-grained partitioning
> >>
> >> We’re also taking this opportunity to fix a few other problems:
> >>
> >>    - Schema evolution: columns are tracked by ID to support
> >> add/drop/rename
> >>    - Types: a core set of types, thoroughly tested to work consistently
> >>    across all of the supported data formats
> >>    - Metrics: cost-based optimization metrics are kept in the snapshots
> >>    - Portable spec: tables should not be tied to Java and should have a
> >>    simple and clear specification for other implementers
> >>
> >> We have the core library to track files done, along with most of a
> >> specification, and a Spark datasource (v2) that can read Iceberg tables.
> >> I’ll be working on the write path next and we plan to build a Presto
> >> implementation soon.
> >>
> >> I think this should be useful to others and it would be great to
> >> collaborate with anyone that is interested.
> >>
> >> rb
> >> 
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I've also created a google group for discussion:
iceberg-devel@googlegroups.com

You can join the group here:
https://groups.google.com/forum/#!forum/iceberg-devel

rb

On Thu, Jan 4, 2018 at 11:03 AM, Ryan Blue <rb...@netflix.com> wrote:

> Just made it public a minute ago. The repo is here: https://github.com/
> Netflix/iceberg
>
> It's built with gradle and requires a Spark 2.3.0-SNAPSHOT (for Datasource
> V2) and Parquet 1.9.1-SNAPSHOT (for API additions and bug fixes).
>
> An early version of the spec is available for comments here:
> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_
> Q8Qf0ctMyGBKslOswA/edit?usp=sharing
>
> Feedback is definitely welcome.
>
> rb
>
> On Wed, Jan 3, 2018 at 6:28 PM, Julien Le Dem <ju...@wework.com>
> wrote:
>
>> Happy new year!
>> I'm interested as well.
>> Did you get to publish your code on github?
>> Thanks
>>
>> On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I'm working on getting the code out to our open source github org,
>>> probably
>>> early next week. I'll set up a mailing list for it as well.
>>>
>>> rb
>>>
>>> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>
>>> > Sounds super interesting. Would love to collaborate on this. Do you
>>> have a
>>> > repo or mailing list where you are working on this?
>>> >
>>> >
>>> >
>>> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
>>> > wrote:
>>> >
>>> >> Hi everyone,
>>> >>
>>> >> I mentioned in the sync-up this morning that I’d send out an
>>> introduction
>>> >> to the table format we’re working on, which we’re calling Iceberg.
>>> >>
>>> >> For anyone that wasn’t around here’s the background: there are several
>>> >> problems with how we currently manage data files to make up a table
>>> in the
>>> >> Hadoop ecosystem. The one that came up today was that you can’t
>>> actually
>>> >> update a table atomically to, for example, rewrite a file and safely
>>> >> delete
>>> >> records. That’s because Hive tables track what files are currently
>>> visible
>>> >> by listing partition directories, and we don’t have (or want)
>>> transactions
>>> >> for changes in Hadoop file systems. This means that you can’t actually
>>> >> have
>>> >> isolated commits to a table and the result is that *query results from
>>> >> Hive
>>> >> tables can be wrong*, though rarely in practice.
>>> >>
>>> >> The problems with current tables are caused primarily by keeping state
>>> >> about what files are in or not in a table in the file system. As I
>>> said,
>>> >> one problem is that there are no transactions but you also have to
>>> list
>>> >> directories to plan jobs (bad on S3) and rename files from a temporary
>>> >> location to a final location (really, really bad on S3).
>>> >>
>>> >> To avoid these problems we’ve been building the Iceberg format that
>>> tracks
>>> >> tracks every file in a table instead of tracking directories. Iceberg
>>> >> maintains snapshots of all the files in a dataset and atomically swaps
>>> >> snapshots and other metadata to commit. There are a few benefits to
>>> doing
>>> >> it this way:
>>> >>
>>> >>    - *Snapshot isolation*: Readers always use a consistent snapshot
>>> of the
>>> >>    table, without needing to hold a lock. All updates are atomic.
>>> >>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a
>>> table
>>> >> to
>>> >>    plan a job, reading a snapshot requires O(1) RPC calls
>>> >>    - *Distributed planning*: File pruning and predicate push-down is
>>> >>    distributed to jobs, removing the metastore bottleneck
>>> >>    - *Version history and rollback*: Table snapshots are kept around
>>> and
>>> >> it
>>> >>    is possible to roll back if a job has a bug and commits
>>> >>    - *Finer granularity partitioning*: Distributed planning and O(1)
>>> RPC
>>> >>    calls remove the current barriers to finer-grained partitioning
>>> >>
>>> >> We’re also taking this opportunity to fix a few other problems:
>>> >>
>>> >>    - Schema evolution: columns are tracked by ID to support
>>> >> add/drop/rename
>>> >>    - Types: a core set of types, thoroughly tested to work
>>> consistently
>>> >>    across all of the supported data formats
>>> >>    - Metrics: cost-based optimization metrics are kept in the
>>> snapshots
>>> >>    - Portable spec: tables should not be tied to Java and should have
>>> a
>>> >>    simple and clear specification for other implementers
>>> >>
>>> >> We have the core library to track files done, along with most of a
>>> >> specification, and a Spark datasource (v2) that can read Iceberg
>>> tables.
>>> >> I’ll be working on the write path next and we plan to build a Presto
>>> >> implementation soon.
>>> >>
>>> >> I think this should be useful to others and it would be great to
>>> >> collaborate with anyone that is interested.
>>> >>
>>> >> rb
>>> >> 
>>> >> --
>>> >> Ryan Blue
>>> >> Software Engineer
>>> >> Netflix
>>> >>
>>> >
>>> >
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Just made it public a minute ago. The repo is here:
https://github.com/Netflix/iceberg

It's built with gradle and requires a Spark 2.3.0-SNAPSHOT (for Datasource
V2) and Parquet 1.9.1-SNAPSHOT (for API additions and bug fixes).

An early version of the spec is available for comments here:
https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit?usp=sharing

Feedback is definitely welcome.

rb

On Wed, Jan 3, 2018 at 6:28 PM, Julien Le Dem <ju...@wework.com>
wrote:

> Happy new year!
> I'm interested as well.
> Did you get to publish your code on github?
> Thanks
>
> On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I'm working on getting the code out to our open source github org,
>> probably
>> early next week. I'll set up a mailing list for it as well.
>>
>> rb
>>
>> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > Sounds super interesting. Would love to collaborate on this. Do you
>> have a
>> > repo or mailing list where you are working on this?
>> >
>> >
>> >
>> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
>> > wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> I mentioned in the sync-up this morning that I’d send out an
>> introduction
>> >> to the table format we’re working on, which we’re calling Iceberg.
>> >>
>> >> For anyone that wasn’t around here’s the background: there are several
>> >> problems with how we currently manage data files to make up a table in
>> the
>> >> Hadoop ecosystem. The one that came up today was that you can’t
>> actually
>> >> update a table atomically to, for example, rewrite a file and safely
>> >> delete
>> >> records. That’s because Hive tables track what files are currently
>> visible
>> >> by listing partition directories, and we don’t have (or want)
>> transactions
>> >> for changes in Hadoop file systems. This means that you can’t actually
>> >> have
>> >> isolated commits to a table and the result is that *query results from
>> >> Hive
>> >> tables can be wrong*, though rarely in practice.
>> >>
>> >> The problems with current tables are caused primarily by keeping state
>> >> about what files are in or not in a table in the file system. As I
>> said,
>> >> one problem is that there are no transactions but you also have to list
>> >> directories to plan jobs (bad on S3) and rename files from a temporary
>> >> location to a final location (really, really bad on S3).
>> >>
>> >> To avoid these problems we’ve been building the Iceberg format that
>> tracks
>> >> tracks every file in a table instead of tracking directories. Iceberg
>> >> maintains snapshots of all the files in a dataset and atomically swaps
>> >> snapshots and other metadata to commit. There are a few benefits to
>> doing
>> >> it this way:
>> >>
>> >>    - *Snapshot isolation*: Readers always use a consistent snapshot of
>> the
>> >>    table, without needing to hold a lock. All updates are atomic.
>> >>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a
>> table
>> >> to
>> >>    plan a job, reading a snapshot requires O(1) RPC calls
>> >>    - *Distributed planning*: File pruning and predicate push-down is
>> >>    distributed to jobs, removing the metastore bottleneck
>> >>    - *Version history and rollback*: Table snapshots are kept around
>> and
>> >> it
>> >>    is possible to roll back if a job has a bug and commits
>> >>    - *Finer granularity partitioning*: Distributed planning and O(1)
>> RPC
>> >>    calls remove the current barriers to finer-grained partitioning
>> >>
>> >> We’re also taking this opportunity to fix a few other problems:
>> >>
>> >>    - Schema evolution: columns are tracked by ID to support
>> >> add/drop/rename
>> >>    - Types: a core set of types, thoroughly tested to work consistently
>> >>    across all of the supported data formats
>> >>    - Metrics: cost-based optimization metrics are kept in the snapshots
>> >>    - Portable spec: tables should not be tied to Java and should have a
>> >>    simple and clear specification for other implementers
>> >>
>> >> We have the core library to track files done, along with most of a
>> >> specification, and a Spark datasource (v2) that can read Iceberg
>> tables.
>> >> I’ll be working on the write path next and we plan to build a Presto
>> >> implementation soon.
>> >>
>> >> I think this should be useful to others and it would be great to
>> >> collaborate with anyone that is interested.
>> >>
>> >> rb
>> >> 
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>> >>
>> >
>> >
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Julien Le Dem <ju...@wework.com>.

Happy new year!
I'm interested as well.
Did you get to publish your code on github?
Thanks

On Fri, Dec 8, 2017 at 8:42 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> I'm working on getting the code out to our open source github org, probably
> early next week. I'll set up a mailing list for it as well.
>
> rb
>
> On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Sounds super interesting. Would love to collaborate on this. Do you have
> a
> > repo or mailing list where you are working on this?
> >
> >
> >
> > On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> >> Hi everyone,
> >>
> >> I mentioned in the sync-up this morning that I’d send out an
> introduction
> >> to the table format we’re working on, which we’re calling Iceberg.
> >>
> >> For anyone that wasn’t around here’s the background: there are several
> >> problems with how we currently manage data files to make up a table in
> the
> >> Hadoop ecosystem. The one that came up today was that you can’t actually
> >> update a table atomically to, for example, rewrite a file and safely
> >> delete
> >> records. That’s because Hive tables track what files are currently
> visible
> >> by listing partition directories, and we don’t have (or want)
> transactions
> >> for changes in Hadoop file systems. This means that you can’t actually
> >> have
> >> isolated commits to a table and the result is that *query results from
> >> Hive
> >> tables can be wrong*, though rarely in practice.
> >>
> >> The problems with current tables are caused primarily by keeping state
> >> about what files are in or not in a table in the file system. As I said,
> >> one problem is that there are no transactions but you also have to list
> >> directories to plan jobs (bad on S3) and rename files from a temporary
> >> location to a final location (really, really bad on S3).
> >>
> >> To avoid these problems we’ve been building the Iceberg format that
> tracks
> >> tracks every file in a table instead of tracking directories. Iceberg
> >> maintains snapshots of all the files in a dataset and atomically swaps
> >> snapshots and other metadata to commit. There are a few benefits to
> doing
> >> it this way:
> >>
> >>    - *Snapshot isolation*: Readers always use a consistent snapshot of
> the
> >>    table, without needing to hold a lock. All updates are atomic.
> >>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table
> >> to
> >>    plan a job, reading a snapshot requires O(1) RPC calls
> >>    - *Distributed planning*: File pruning and predicate push-down is
> >>    distributed to jobs, removing the metastore bottleneck
> >>    - *Version history and rollback*: Table snapshots are kept around and
> >> it
> >>    is possible to roll back if a job has a bug and commits
> >>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
> >>    calls remove the current barriers to finer-grained partitioning
> >>
> >> We’re also taking this opportunity to fix a few other problems:
> >>
> >>    - Schema evolution: columns are tracked by ID to support
> >> add/drop/rename
> >>    - Types: a core set of types, thoroughly tested to work consistently
> >>    across all of the supported data formats
> >>    - Metrics: cost-based optimization metrics are kept in the snapshots
> >>    - Portable spec: tables should not be tied to Java and should have a
> >>    simple and clear specification for other implementers
> >>
> >> We have the core library to track files done, along with most of a
> >> specification, and a Spark datasource (v2) that can read Iceberg tables.
> >> I’ll be working on the write path next and we plan to build a Presto
> >> implementation soon.
> >>
> >> I think this should be useful to others and it would be great to
> >> collaborate with anyone that is interested.
> >>
> >> rb
> >> 
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I'm working on getting the code out to our open source github org, probably
early next week. I'll set up a mailing list for it as well.

rb

On Thu, Dec 7, 2017 at 6:38 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Sounds super interesting. Would love to collaborate on this. Do you have a
> repo or mailing list where you are working on this?
>
>
>
> On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi everyone,
>>
>> I mentioned in the sync-up this morning that I’d send out an introduction
>> to the table format we’re working on, which we’re calling Iceberg.
>>
>> For anyone that wasn’t around here’s the background: there are several
>> problems with how we currently manage data files to make up a table in the
>> Hadoop ecosystem. The one that came up today was that you can’t actually
>> update a table atomically to, for example, rewrite a file and safely
>> delete
>> records. That’s because Hive tables track what files are currently visible
>> by listing partition directories, and we don’t have (or want) transactions
>> for changes in Hadoop file systems. This means that you can’t actually
>> have
>> isolated commits to a table and the result is that *query results from
>> Hive
>> tables can be wrong*, though rarely in practice.
>>
>> The problems with current tables are caused primarily by keeping state
>> about what files are in or not in a table in the file system. As I said,
>> one problem is that there are no transactions but you also have to list
>> directories to plan jobs (bad on S3) and rename files from a temporary
>> location to a final location (really, really bad on S3).
>>
>> To avoid these problems we’ve been building the Iceberg format that tracks
>> tracks every file in a table instead of tracking directories. Iceberg
>> maintains snapshots of all the files in a dataset and atomically swaps
>> snapshots and other metadata to commit. There are a few benefits to doing
>> it this way:
>>
>>    - *Snapshot isolation*: Readers always use a consistent snapshot of the
>>    table, without needing to hold a lock. All updates are atomic.
>>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table
>> to
>>    plan a job, reading a snapshot requires O(1) RPC calls
>>    - *Distributed planning*: File pruning and predicate push-down is
>>    distributed to jobs, removing the metastore bottleneck
>>    - *Version history and rollback*: Table snapshots are kept around and
>> it
>>    is possible to roll back if a job has a bug and commits
>>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
>>    calls remove the current barriers to finer-grained partitioning
>>
>> We’re also taking this opportunity to fix a few other problems:
>>
>>    - Schema evolution: columns are tracked by ID to support
>> add/drop/rename
>>    - Types: a core set of types, thoroughly tested to work consistently
>>    across all of the supported data formats
>>    - Metrics: cost-based optimization metrics are kept in the snapshots
>>    - Portable spec: tables should not be tied to Java and should have a
>>    simple and clear specification for other implementers
>>
>> We have the core library to track files done, along with most of a
>> specification, and a Spark datasource (v2) that can read Iceberg tables.
>> I’ll be working on the write path next and we plan to build a Presto
>> implementation soon.
>>
>> I think this should be useful to others and it would be great to
>> collaborate with anyone that is interested.
>>
>> rb
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Jacques Nadeau <ja...@apache.org>.

Sounds super interesting. Would love to collaborate on this. Do you have a
repo or mailing list where you are working on this?



On Wed, Dec 6, 2017 at 4:20 PM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> I mentioned in the sync-up this morning that I’d send out an introduction
> to the table format we’re working on, which we’re calling Iceberg.
>
> For anyone that wasn’t around here’s the background: there are several
> problems with how we currently manage data files to make up a table in the
> Hadoop ecosystem. The one that came up today was that you can’t actually
> update a table atomically to, for example, rewrite a file and safely delete
> records. That’s because Hive tables track what files are currently visible
> by listing partition directories, and we don’t have (or want) transactions
> for changes in Hadoop file systems. This means that you can’t actually have
> isolated commits to a table and the result is that *query results from Hive
> tables can be wrong*, though rarely in practice.
>
> The problems with current tables are caused primarily by keeping state
> about what files are in or not in a table in the file system. As I said,
> one problem is that there are no transactions but you also have to list
> directories to plan jobs (bad on S3) and rename files from a temporary
> location to a final location (really, really bad on S3).
>
> To avoid these problems we’ve been building the Iceberg format that tracks
> tracks every file in a table instead of tracking directories. Iceberg
> maintains snapshots of all the files in a dataset and atomically swaps
> snapshots and other metadata to commit. There are a few benefits to doing
> it this way:
>
>    - *Snapshot isolation*: Readers always use a consistent snapshot of the
>    table, without needing to hold a lock. All updates are atomic.
>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
>    plan a job, reading a snapshot requires O(1) RPC calls
>    - *Distributed planning*: File pruning and predicate push-down is
>    distributed to jobs, removing the metastore bottleneck
>    - *Version history and rollback*: Table snapshots are kept around and it
>    is possible to roll back if a job has a bug and commits
>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
>    calls remove the current barriers to finer-grained partitioning
>
> We’re also taking this opportunity to fix a few other problems:
>
>    - Schema evolution: columns are tracked by ID to support add/drop/rename
>    - Types: a core set of types, thoroughly tested to work consistently
>    across all of the supported data formats
>    - Metrics: cost-based optimization metrics are kept in the snapshots
>    - Portable spec: tables should not be tied to Java and should have a
>    simple and clear specification for other implementers
>
> We have the core library to track files done, along with most of a
> specification, and a Spark datasource (v2) that can read Iceberg tables.
> I’ll be working on the write path next and we plan to build a Presto
> implementation soon.
>
> I think this should be useful to others and it would be great to
> collaborate with anyone that is interested.
>
> rb
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg table format

Posted by Atri Sharma <at...@gmail.com>.

Thanks.

Sorry for the brevity, I was on vacation and sending emails from my phone.

My main idea there was that since the proposed architecture is not
tied to Parquet in any manner, we can go ahead and allow other file
formats to hook in using an API.

Can you please share the code and spec so that we can start thinking
about extending it (specifically around deletes?)

Regards,

Atri

On Tue, Dec 12, 2017 at 12:55 AM, Ryan Blue <rb...@netflix.com> wrote:
> On Sat, Dec 9, 2017 at 3:38 AM, Atri Sharma <at...@gmail.com> wrote:
>>
>> Thanks for the specification.
>>
>> A couple of questions:
>>
>> 1) what does this to parquet and not to any underlying store?
>> 2) If above is not true, can we expose an interface to install any
>> underlying file format?
>> 3) if we are defining snapshots, can we allow MVCC on top of the
>> snapshots?
>>
>> To elaborate on 3) I would like to see a full
>> set transactional file Format present which allows us to be generic and
>> performant.
>>
>> I would be interested in doing a specification for update in this format.
>> Can you please share the repository link and some internale documents to
>> Understand a bit more?
>
>
> The format uses a form of MVCC to ensure readers always use a consistent
> snapshot of the table without blocking writers. New versions show up
> atomically.
>
> However, this doesn't use a "transactional file format". It uses an atomic
> operation to replace a table's current metadata using immutable files. This
> is necessary for compatibility with file systems like HDFS and S3 where the
> files must be stored.
>
> I'm not sure what you mean by questions 1 or 2. Parquet is a data file
> format used in Iceberg tables. Avro is also allowed so that the tables
> support a write-optimized format (Avro) and a read-optimized format
> (Parquet). File formats are tracked on a per-file basis.
>
> rb
>



-- 
Regards,

Atri
l'apprenant

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

On Sat, Dec 9, 2017 at 3:38 AM, Atri Sharma <at...@gmail.com> wrote:

> Thanks for the specification.
>
> A couple of questions:
>
> 1) what does this to parquet and not to any underlying store?
> 2) If above is not true, can we expose an interface to install any
> underlying file format?
> 3) if we are defining snapshots, can we allow MVCC on top of the snapshots?
>
> To elaborate on 3) I would like to see a full
> set transactional file Format present which allows us to be generic and
> performant.
>
> I would be interested in doing a specification for update in this format.
> Can you please share the repository link and some internale documents to
> Understand a bit more?
>

The format uses a form of MVCC to ensure readers always use a consistent
snapshot of the table without blocking writers. New versions show up
atomically.

However, this doesn't use a "transactional file format". It uses an atomic
operation to replace a table's current metadata using immutable files. This
is necessary for compatibility with file systems like HDFS and S3 where the
files must be stored.

I'm not sure what you mean by questions 1 or 2. Parquet is a data file
format used in Iceberg tables. Avro is also allowed so that the tables
support a write-optimized format (Avro) and a read-optimized format
(Parquet). File formats are tracked on a per-file basis.

rb

Re: Iceberg table format

Posted by Atri Sharma <at...@gmail.com>.

Thanks for the specification.

A couple of questions:

1) what does this to parquet and not to any underlying store?
2) If above is not true, can we expose an interface to install any
underlying file format?
3) if we are defining snapshots, can we allow MVCC on top of the snapshots?

To elaborate on 3) I would like to see a full
set transactional file Format present which allows us to be generic and
performant.

I would be interested in doing a specification for update in this format.
Can you please share the repository link and some internale documents to
Understand a bit more?

Regards,
Atri
On 7 Dec 2017 05:50, "Ryan Blue" <rb...@net-flix.com.invalid> wrote:

Hi everyofor

metne,

I mentioned in the sync-up this morning that I’d send out an introduction
to the table format we’re working on, which we’re calling Iceberg.

For anyone that wasn’t around here’s the background: there are several
problems with how we currently manage data files to make up a table in the
Hadoop ecosystem. The one that came up today was that you can’t actually
update a table atomically to, for example, rewrite a file and safely delete
records. That’s because Hive tables track what files are currently visible
by listing partition directories, and we don’t have (or want) transactions
for changes in Hadoop file systems. This means that you can’t actually have
isolated commits to a table and the result is that *query results from Hive
tables can be wrong*, though rarely in practice.

The problems with current tables are caused primarily by keeping state
about what files are in or not in a table in the file system. As I said,
one problem is that there are no transactions but you also have to list
directories to plan jobs (bad on S3) and rename files from a temporary
location to a final location (really, really bad on S3).

To avoid these problems we’ve been building the Iceberg format that tracks
tracks every file in a table instead of tracking directories. Iceberg
maintains snapshots of all the files in a dataset and atomically swaps
snapshots and other metadata to commit. There are a few benefits to doing
it this way:

   - *Snapshot isolation*: Readers always use a consistent snapshot of the
   table, without needing to hold a lock. All updates are atomic.
   - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
   plan a job, reading a snapshot requires O(1) RPC calls
   - *Distributed planning*: File pruning and predicate push-down is
   distributed to jobs, removing the metastore bottleneck
   - *Version history and rollback*: Table snapshots are kept around and it
   is possible to roll back if a job has a bug and commits
   - *Finer granularity partitioning*: Distributed planning and O(1) RPC
   calls remove the current barriers to finer-grained partitioning

We’re also taking this opportunity to fix a few other problems:

   - Schema evolution: columns are tracked by ID to support add/drop/rename
   - Types: a core set of types, thoroughly tested to work consistently
   across all of the supported data formats
   - Metrics: cost-based optimization metrics are kept in the snapshots
   - Portable spec: tables should not be tied to Java and should have a
   simple and clear specification for other implementers

We have the core library to track files done, along with most of a
specification, and a Spark datasource (v2) that can read Iceberg tables.
I’ll be working on the write path next and we plan to build a Presto
implementation soon.

I think this should be useful to others and it would be great to
collaborate with anyone that is interested.

rb

--
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Replies inline.

On Sat, Dec 9, 2017 at 2:00 AM, Julian Hyde <jh...@apache.org> wrote:

> Since you’re dealing with change (atomic updates and so forth) are we
> still just talking about a *format* or does the architecture also now
> include pieces that could be described as *servers* and *protocols*?
>

We're talking primarily about a format. Iceberg defines a way to maintain
table contents in metadata files. The scheme requires some atomic operation
to change the latest metadata file, which I've prototyped with an atomic
rename in HDFS. We are going to use an atomic swap (check-and-put) in our
metastore for the purpose. Otherwise, it's a format.

> When I first spoke to the CarbonData folks I observed that they were
> maintaining a global index and therefore were something more than just a
> format. Are you taking parquet into the same territory? And if not, what
> assumptions did you make to keep things simple?
>

Although this is on the Parquet list, it is a higher-level project that
we're starting up. From our discussion on the sync-up, people on this list
are interested so I thought I'd throw it out to the community. I'll start a
list for this project and stop spamming the Parquet dev list soon.

I'm not sure how to answer your question about assumptions we've made.
Maybe it will be more clear when we get more docs out. Until then, I'm
happy to answer questions about how it works.

> I hope I’m not being too negative. If you’re solving distributed metastore
> problems that would be huge. But it sounds a bit too good to be true.
>

You're not at all too negative. I realize that this is new and we have yet
to prove out portions of it. What we have so far works.

My initial tests are on our job metrics table, which has 10 months of
task-level data from our production Hadoop clusters. Each day is is about
400m rows in 30GB across 250 Parquet files. In all, the test table is about
65,000 files. The current Iceberg library takes about 800ms to select the
files for a query, using a single thread (the format supports breaking the
work across multiple threads). And because this is scanning a manifest of
all the files in a table, it can plan any query on that table in this
amount of time. Planning in 1s isn't incredible, but it is a major
improvement over the current Hive layout considering that some of our table
scans take 10 minutes to plan (using 8 threads) due to directory listing in
S3.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg table format

Posted by Julian Hyde <jh...@apache.org>.

Since youâre dealing with change (atomic updates and so forth) are we still just talking about a *format* or does the architecture also now include pieces that could be described as *servers* and *protocols*?

When I first spoke to the CarbonData folks I observed that they were maintaining a global index and therefore were something more than just a format. Are you taking parquet into the same territory? And if not, what assumptions did you make to keep things simple?

I hope Iâm not being too negative. If youâre solving distributed metastore problems that would be huge. But it sounds a bit too good to be true. 

Julian

On 2017-12-06 16:20, Ryan Blue <rb...@netflix.com.INVALID> wrote: 
> Hi everyone,
> 
> I mentioned in the sync-up this morning that Iâd send out an introduction
> to the table format weâre working on, which weâre calling Iceberg.
> 
> For anyone that wasnât around hereâs the background: there are several
> problems with how we currently manage data files to make up a table in the
> Hadoop ecosystem. The one that came up today was that you canât actually
> update a table atomically to, for example, rewrite a file and safely delete
> records. Thatâs because Hive tables track what files are currently visible
> by listing partition directories, and we donât have (or want) transactions
> for changes in Hadoop file systems. This means that you canât actually have
> isolated commits to a table and the result is that *query results from Hive
> tables can be wrong*, though rarely in practice.
> 
> The problems with current tables are caused primarily by keeping state
> about what files are in or not in a table in the file system. As I said,
> one problem is that there are no transactions but you also have to list
> directories to plan jobs (bad on S3) and rename files from a temporary
> location to a final location (really, really bad on S3).
> 
> To avoid these problems weâve been building the Iceberg format that tracks
> tracks every file in a table instead of tracking directories. Iceberg
> maintains snapshots of all the files in a dataset and atomically swaps
> snapshots and other metadata to commit. There are a few benefits to doing
> it this way:
> 
>    - *Snapshot isolation*: Readers always use a consistent snapshot of the
>    table, without needing to hold a lock. All updates are atomic.
>    - *O(1) RPCs to plan*: Instead of listing O(n) directories in a table to
>    plan a job, reading a snapshot requires O(1) RPC calls
>    - *Distributed planning*: File pruning and predicate push-down is
>    distributed to jobs, removing the metastore bottleneck
>    - *Version history and rollback*: Table snapshots are kept around and it
>    is possible to roll back if a job has a bug and commits
>    - *Finer granularity partitioning*: Distributed planning and O(1) RPC
>    calls remove the current barriers to finer-grained partitioning
> 
> Weâre also taking this opportunity to fix a few other problems:
> 
>    - Schema evolution: columns are tracked by ID to support add/drop/rename
>    - Types: a core set of types, thoroughly tested to work consistently
>    across all of the supported data formats
>    - Metrics: cost-based optimization metrics are kept in the snapshots
>    - Portable spec: tables should not be tied to Java and should have a
>    simple and clear specification for other implementers
> 
> We have the core library to track files done, along with most of a
> specification, and a Spark datasource (v2) that can read Iceberg tables.
> Iâll be working on the write path next and we plan to build a Presto
> implementation soon.
> 
> I think this should be useful to others and it would be great to
> collaborate with anyone that is interested.
> 
> rb
> â
> -- 
> Ryan Blue
> Software Engineer
> Netflix
>