You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Ryan Blue <rb...@netflix.com.INVALID> on 2020/03/28 00:53:43 UTC

Iceberg community sync - 2020-03-25

Hi everyone,

Here are my notes from the discussion. These are based mainly on my memory,
so feel free to correct or expand if you think it can be improved. Thanks!

*Agenda*

   - Cadence for syncs - every 2-4 weeks?
   - 0.8.0 Java release
   - Community building
   - Flink source and sink status
   - MR formats and Hive support status
   - Security (authorization, data values in metadata)
   - Row-level deletes (main discussion)

*Discussion*:

   - Sync cadence
      - Ryan: with syncs alternating time zones, 4 weeks is too long, but 2
      weeks is a lot for those of us attending all of them. How about 3 weeks?
      - Consensus was every 3 weeks
   - 0.8.0 Java release
      - When should we target for the release? Consensus was for Mid-April
      (3 weeks)
      - What do we want in the release? Main outstanding features are ORC
      support, Parquet vectorized reads, Spark/Hive changes
      - Ideally will include ORC support, since it is close
      - Hive version is 2.3 and should not block Hive work
      - Vectorized reads are nice-to-have but should not block a release
      - Can we disable consistent versions for Spark 2.4 and Spark 3.0
      support in the same repo? Ryan will dig up build script with baseline
      applied to only some modules, maybe we can disable it
   - Community building
      - Saisai suggested a Powered By page where we can post who is using
      Iceberg in production. Great idea!
      - Openinx suggested a blog section of the docs site
      - Ryan has concerns about blogs in docs - why not link to blogs on
      other platforms? We don’t want content to get stale or have the community
      “reviewing” content.
      - Owen: some blogs break links
   - Flink source and sinks status
      - Tencent data lake team posted a sink based on Netflix skunkworks,
      but needs to remove Netflix-specific features/dependencies
      - Issues opened for work to get sink in
      - Ryan: we’ll need reviewers because I’m not qualified. Will reach
      out to Steven Wu (Netflix sink author) and other people
interested in Flink.
      - Ryan: the Spark source is coming along, but the hardest part is
      getting a stream of files to process from table state. Is that
something we
      want to share between Spark and Flink implementations?
      - Probably want to share, if possible
   - Skipped MR/Hive status and security (will start dev list thread) to
   get to row-level deletes
   - Row-level deletes roadmap:
      - Ryan will be working on this more, with a doc for Spark MERGE INTO
      interfaces coming soon
      - This has been moving slowly because some parts, like sequence
      numbers, require forward-breaking/v2 changes
      - Owen suggested building two parallel write paths to be able to
      write v1. Everyone agreed with this
      - There are several projects that can be done by anyone and do not
      require forward-breaking/v2 changes: delete file format readers, writers,
      record iterator implementations to merge deletes (set-based,
merge-based),
      and specs for these once they are built
      - Junjie offered to work on file/position delete files
      - Equality delete merges are blocked on sort order addition to the
      format
      - Main blocking decision point is how to track delete files in
      manifests, Ryan will start a dev list thread
      - Openinx brought up concerns about minimizing end-to-end latency for
      a use case with high write volume for equality deletes
      - Ryan’s response was that this will likely require off-line
      optimization: write equality deletes from Flink but rewrite in a more
      efficient format (sorted, translated to file/position, etc.) in
a separate
      service. Enabling these services is the role of Iceberg, which is an
      at-rest format. Other approaches put this complexity into the writer, but
      it has to be done somewhere.
      - Gautam: what about GDPR deletes?
      - Ryan: GDPR deletes are a simpler case, where volume is much lower.
      That brings us back to the roadmap: let’s focus on simpler end-to-end use
      cases and get those done. Then we can work on scaling them. First things
      are to get the formats defined and documented, get a set-based delete
      filter implementation for equality deletes and a merge-based one for
      file/position deletes, and to add sequence numbers.
   - Thanks to everyone that attended! Will schedule the next sync for 3
   weeks from now.

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg community sync - 2020-03-25

Posted by OpenInx <op...@gmail.com>.

> Ryan has concerns about blogs in docs - why not link to blogs on other
platforms? We don’t want content to get stale or have the community
“reviewing” content.
I mean we could create a page to collect all the design doc links first.
The stale content is indeed a problem unless we update the doc for each
relative change. I don't have the strong opinion about the reviewing
comments :-)

> Ryan: we’ll need reviewers because I’m not qualified. Will reach out to
Steven Wu (Netflix sink author) and other people interested in Flink.
Steven did a great job, he's the perfect reviewer if he has the bandwidth.
There're some flink committers and PMC in our flink team, we could also
ping them.

> Openinx brought up concerns about minimizing end-to-end latency
Agreed that we could implement the file/pos deletes and equality-deletes
firstly. The off-line optimization seems reasonable, we also have an
internal discussion about the e2e latency and have some ideas to minimize
it, maybe I could provide a simple doc to describe the idea. Anyway we
could push the file/pos and equality deletes forward first.

On Sat, Mar 28, 2020 at 8:54 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> Here are my notes from the discussion. These are based mainly on my
> memory, so feel free to correct or expand if you think it can be improved.
> Thanks!
>
> *Agenda*
>
>    - Cadence for syncs - every 2-4 weeks?
>    - 0.8.0 Java release
>    - Community building
>    - Flink source and sink status
>    - MR formats and Hive support status
>    - Security (authorization, data values in metadata)
>    - Row-level deletes (main discussion)
>
> *Discussion*:
>
>    - Sync cadence
>       - Ryan: with syncs alternating time zones, 4 weeks is too long, but
>       2 weeks is a lot for those of us attending all of them. How about 3 weeks?
>       - Consensus was every 3 weeks
>    - 0.8.0 Java release
>       - When should we target for the release? Consensus was for
>       Mid-April (3 weeks)
>       - What do we want in the release? Main outstanding features are ORC
>       support, Parquet vectorized reads, Spark/Hive changes
>       - Ideally will include ORC support, since it is close
>       - Hive version is 2.3 and should not block Hive work
>       - Vectorized reads are nice-to-have but should not block a release
>       - Can we disable consistent versions for Spark 2.4 and Spark 3.0
>       support in the same repo? Ryan will dig up build script with baseline
>       applied to only some modules, maybe we can disable it
>    - Community building
>       - Saisai suggested a Powered By page where we can post who is using
>       Iceberg in production. Great idea!
>       - Openinx suggested a blog section of the docs site
>       - Ryan has concerns about blogs in docs - why not link to blogs on
>       other platforms? We don’t want content to get stale or have the community
>       “reviewing” content.
>       - Owen: some blogs break links
>    - Flink source and sinks status
>       - Tencent data lake team posted a sink based on Netflix skunkworks,
>       but needs to remove Netflix-specific features/dependencies
>       - Issues opened for work to get sink in
>       - Ryan: we’ll need reviewers because I’m not qualified. Will reach
>       out to Steven Wu (Netflix sink author) and other people interested in Flink.
>       - Ryan: the Spark source is coming along, but the hardest part is
>       getting a stream of files to process from table state. Is that something we
>       want to share between Spark and Flink implementations?
>       - Probably want to share, if possible
>    - Skipped MR/Hive status and security (will start dev list thread) to
>    get to row-level deletes
>    - Row-level deletes roadmap:
>       - Ryan will be working on this more, with a doc for Spark MERGE
>       INTO interfaces coming soon
>       - This has been moving slowly because some parts, like sequence
>       numbers, require forward-breaking/v2 changes
>       - Owen suggested building two parallel write paths to be able to
>       write v1. Everyone agreed with this
>       - There are several projects that can be done by anyone and do not
>       require forward-breaking/v2 changes: delete file format readers, writers,
>       record iterator implementations to merge deletes (set-based, merge-based),
>       and specs for these once they are built
>       - Junjie offered to work on file/position delete files
>       - Equality delete merges are blocked on sort order addition to the
>       format
>       - Main blocking decision point is how to track delete files in
>       manifests, Ryan will start a dev list thread
>       - Openinx brought up concerns about minimizing end-to-end latency
>       for a use case with high write volume for equality deletes
>       - Ryan’s response was that this will likely require off-line
>       optimization: write equality deletes from Flink but rewrite in a more
>       efficient format (sorted, translated to file/position, etc.) in a separate
>       service. Enabling these services is the role of Iceberg, which is an
>       at-rest format. Other approaches put this complexity into the writer, but
>       it has to be done somewhere.
>       - Gautam: what about GDPR deletes?
>       - Ryan: GDPR deletes are a simpler case, where volume is much
>       lower. That brings us back to the roadmap: let’s focus on simpler
>       end-to-end use cases and get those done. Then we can work on scaling them.
>       First things are to get the formats defined and documented, get a set-based
>       delete filter implementation for equality deletes and a merge-based one for
>       file/position deletes, and to add sequence numbers.
>    - Thanks to everyone that attended! Will schedule the next sync for 3
>    weeks from now.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>