You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Carl Steinbach <cw...@apache.org> on 2021/08/04 03:06:03 UTC
[NOTES] Iceberg Community Meeting - July 21 2021

Iceberg Community Meetings are open to everyone. To receive an invitation
to the next meeting, please join the iceberg-sync@googlegroups.com
<https://groups.google.com/g/iceberg-sync> list.
Notes from previous meetings along with a running agenda for the next
meeting are available here:
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?pli=1#heading=h.z3dncl7gr8m1

21 July 2021

   -

   Releases
   -

      0.12 Release status
      -

         Currently blocked on “Handle the case that RewriteFiles and
         RowDelta commit the transaction at the same time”  #2308
         <https://github.com/apache/iceberg/issues/2308>. Ryan is working
         on a fix.
         -

      Consider dropping support for Spark 3.0 and 3.1 after 0.12 once Spark
      3.2 is available
      -

         Spark 3.2 is set to include many changes to DSv2 which we can
         leverage to make our code simpler. Examples include
eliminating the need to
         provide our own distribution and sort ordering utils for
Spark, and the
         ability to deal with Spark expressions directly instead of via Iceberg
         wrapper code.
         -

         Should we just cut support for 3.0 and 3.1, and instead just do
         3.2 support in the next release, in order to avoid doing a three-way
         version split, which currently looks like it would require an
additional
         Spark module that is 3.2 specific.
         -

         [Anton] This is not just about the tech debt added by shims. It’s
         also about not being able to use certain Spark APIs that have been
         introduced in new versions. For example, in 3.1 there is the
purge flag, as
         well as APIs in structured streaming related to limit
support. In 3.2 there
         is the distribution and ordering support. I’m in favor of
keeping it simple
         and release 0.12 with the release for all Spark versions, and
then migrate
         to Spark 3.2 in the next version of Iceberg.
         -

         [Ryan] To recap, the main issue is that we would need to bump the
         Spark version to 3.2 in order to pull in the new interfaces,
and then when
         you roll back, and you use that same module in our 3.1, the
interfaces are
         missing, so we can't actually load them. I think we may be
able to solve
         this problem by not loading the interface until it is
actually needed. In
         other words, have a method on the object that is from 3.2, and then
         basically copy the object and mix in the interface at that
point. Sometimes
         you can get away with having an extra class in there, but not
loading the
         part of it actually depends on the missing interface. That
sometimes you
         can get away with, like, having an extra class in there. I’ll do some
         testing and see if I can get this working between Spark 3.2 and 3.1.
         -

         Conclusion: keep this discussion open for a bit longer while Ryan
         does some exploration to see if his approach is viable.
         -

   Slack community
   -

      [Ryan] At the last meeting we discussed ways of making it easier for
      community members to join the Iceberg channel on the ASF’s Slack
workspace.
      The discussion was tabled when it became known that there’s a self-invite
      link. Unfortunately, it turns out the link regularly breaks and the ASF
      INFRA team has declined to fix it this time because of an influx of
      spammers. Carl created a separate Slack workspace dedicated to Apache
      Iceberg. I think we should migrate to this space since making it easy for
      everyone to join and enter the discussion is more important and
leveraging
      the existing ASF infrastructure. Since I’m seeing lots of +1s for this on
      the chat I think the next step is to raise this issue on the dev
list. (related
      thread
      <https://lists.apache.org/thread.html/r4a23572882f421944ed545f5d7dd798b3580c120e7e246a3f604cfcf%40%3Cdev.iceberg.apache.org%3E>,
      Slack invite link
      <https://join.slack.com/t/apache-iceberg/shared_invite/zt-tlv0zjz6-jGJEkHfb1~heMCJA3Uycrg>
      )
      -

      Addendum: On the dev list thread we decided to move to the
      apache-iceberg Slack workspace.
      -

   Bucketing with Unicode characters (#2837
   <https://github.com/apache/iceberg/issues/2837>)
   -

      Mateusz Gajewski at Starburst discovered that Iceberg’s bucket hash
      function for Strings generates values that don’t adhere to the
Iceberg spec
      when the input String contains Unicode surrogate pair characters
      <https://docs.microsoft.com/en-us/globalization/encoding/surrogate-pairs#:~:text=With%20surrogate%20pairs%2C%20a%20Unicode,over%20one%20million%20additional%20characters.>.
      The root cause of this issue is a bug in Guava’s
Hashing.murmur3_32().hashString
      method <https://github.com/google/guava/issues/5648>.
      -

      It’s easy to work around this issue in Iceberg by using
      murmur3_32().hashBytes in place of murmur3_32().hashString, but
what do we
      need to do to help users who potentially have existing data
stored this way?
      -

      Two approaches were discussed: (1) provide a compatibility mode that
      would produce both bucket hash values or fallback to the old
behavior, and
      (2) provide a Spark action that users can use to fix their data. People
      felt (1) was risky on account of lots of potential corner cases, and Ryan
      noted that (2) is something we need to invest in any way to help people
      migrate from one partitioning scheme to another.
      -

      Conclusion: (1) document how to correct the data using MERGE INTO,
      (2) fix the bucket function, and (3) add a Spark action for
correcting the
      data.
      -

   Z-Ordering
   -

      Bhavyam Kamal presented his proposal
      <https://docs.google.com/document/d/1UfGxaB7qlrGzzMk9pBm03oKPOkm-jk-NQVQQvHP-0Bc/edit>
      for adding Z-Ordering to Iceberg and demoed his prototype implementation.
      Z-Ordering is a technique for clustering data in multiple dimensions to
      create mutually exclusive data files, which then results in more
efficient
      file pruning when applying predicates.
      -

      Conclusion: The plan going forward is to split the work into two
      phases:
      -

         1) Implement merge sort based compaction and allow
         compaction/rewrite of data files using a space filling curve
based sort. No
         planning or persisting of metrics.
         -

         2) Support for Transforms with multiple arguments and possible
         parameterization, store metrics for curve values in datafile
metrics along
         with transform used when writing file, and modify query
planning to use
         these metrics.
         -


         - We ran out of time before getting to the following topics:
      -

      APIs deprecated in 0.11 and scheduled for removal in 0.12
      -

      Relative paths in the metadata (design doc
      <https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0>
      )
      -

         JSON metadata location
         -

         Source of truth for table roots
         -

         Is there an alternative that supports use cases better?
         -

      Sort Ordering
      -

      Secondary Indexes
      -

      Commit message format and PR description template
      -

      Manifest V2 Discuss Thread