You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Gabor Szadovszky <ga...@apache.org> on 2019/11/13 13:33:12 UTC

[VOTE] Release Apache Parquet 1.11.0 RC7

Hi everyone,

I propose the following RC to be released as official Apache Parquet 1.11.0
release.

The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
* This corresponds to the tag: apache-parquet-1.11.0-rc7
*
https://github.com/apache/parquet-mr/tree/18519eb8e059865652eee3ff0e8593f126701da4

The release tarball, signature, and checksums are here:
* https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-1.11.0-rc7

You can find the KEYS file here:
* https://apache.org/dist/parquet/KEYS

Binary artifacts are staged in Nexus here:
* https://repository.apache.org/content/groups/staging/org/apache/parquet/

This release includes the changes listed at:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0-rc7/CHANGES.md

Please download, verify, and test.

Please vote in the next 72 hours.

[ ] +1 Release this as Apache Parquet 1.11.0
[ ] +0
[ ] -1 Do not release this because...

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Xinli shang <sh...@uber.com.INVALID>.

Hi Ryan/Gabor,

I will do some tests on real data with checksum enabled.

Xinli

On Wed, Nov 20, 2019 at 1:29 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Thanks, Fokko.
>
> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
> time to do that in the next couple of weeks.
>
> Cheers,
> Gabor
>
> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > Thanks Gabor for the explanation. I'd like to change my vote to +1
> > (non-binding).
> >
> > Cheers, Fokko
> >
> > Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <rblue@netflix.com.invalid
> >
> >
> > > Gabor, what I meant was: have we tried this with real data to see the
> > > effect? I think those results would be helpful.
> > >
> > > On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
> > > wrote:
> > >
> > > > Hi Ryan,
> > > >
> > > > It is not easy to calculate. For the column indexes feature we
> > introduced
> > > > two new structures saved before the footer: column indexes and offset
> > > > indexes. If the min/max values are not too long, then the truncation
> > > might
> > > > not decrease the file size because of the offset indexes. Moreover,
> we
> > > also
> > > > introduced parquet.page.row.count.limit which might increase the
> number
> > > of
> > > > pages which leads to increasing the file size.
> > > > The footer itself is also changed and we are saving more values in
> it:
> > > the
> > > > offset values to the column/offset indexes, the new logical type
> > > > structures, the CRC checksums (we might have some others).
> > > > So, the size of the files with small amount of data will be increased
> > > > (because of the larger footer). The size of the files where the
> values
> > > can
> > > > be encoded very well (RLE) will probably be increased (because we
> will
> > > have
> > > > more pages). The size of some files where the values are long
> (>64bytes
> > > by
> > > > default) might be decreased because of truncating the min/max values.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rblue@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Gabor, do we have an idea of the additional overhead for a non-test
> > > data
> > > > > file? It should be easy to validate that this doesn't introduce an
> > > > > unreasonable amount of overhead. In some cases, it should actually
> be
> > > > > smaller since the column indexes are truncated and page stats are
> > not.
> > > > >
> > > > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > <ga...@cloudera.com.invalid> wrote:
> > > > >
> > > > > > Hi Fokko,
> > > > > >
> > > > > > For the first point. The referenced constructor is private and
> > > Iceberg
> > > > > uses
> > > > > > it via reflection. It is not a breaking change. I think,
> parquet-mr
> > > > shall
> > > > > > not keep private methods only because of clients might use them
> via
> > > > > > reflection.
> > > > > >
> > > > > > About the checksum. I've agreed on having the CRC checksum write
> > > > enabled
> > > > > by
> > > > > > default because the benchmarks did not show significant
> performance
> > > > > > penalties. See
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_647&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=1QqG8osJ05dcq3kAsALygqXBr-LVzQdSs_hRCp3ljWg&e=
> for
> > > > > details.
> > > > > >
> > > > > > About the file size change. 1.11.0 is introducing column indexes,
> > CRC
> > > > > > checksum, removing the statistics from the page headers and maybe
> > > other
> > > > > > changes that impact file size. If only file size is in question I
> > > > cannot
> > > > > > see a breaking change here.
> > > > > >
> > > > > > Regards,
> > > > > > Gabor
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > <fokko@driesprong.frl
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Unfortunately, a -1 from my side (non-binding)
> > > > > > >
> > > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > > > > >
> > > > > > >    - We've broken backward compatibility of the constructor of
> > > > > > >    ColumnChunkPageWriteStore
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR80&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kjdKf5G4aBgAjWGMzvaBHm3qQwrn1lyDYjFfWjdqKbc&e=
> > > > > > > >.
> > > > > > >    This required a change
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2Db877faa96f292b851c75fe8bcc1912f8R176&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=wGGSIQ9tS_WY5Xx6OLcgVlPblirY01kM_W9o0YmzG28&e=
> > > > > > > >
> > > > > > >    to the code. This isn't a hard blocker, but if there will
> be a
> > > new
> > > > > RC,
> > > > > > > I've
> > > > > > >    submitted a patch:
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_699&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=404yTBuM-XBj9OfBM_x5artWTHDOFnZLj3iuCT3n0iU&e=
> > > > > > >    - Related, that we need to put in the changelog, is that
> > > checksums
> > > > > are
> > > > > > >    enabled by default:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dcolumn_src_main_java_org_apache_parquet_column_ParquetProperties.java-23L54&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=N1r-at2NYbuKi71Z6xwzy2c6DjtbpbOzc2gOHcsrlkk&e=
> > > > > > > This
> > > > > > >    will impact performance. I would suggest disabling it by
> > > default:
> > > > > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_700&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=s3EiQII3WgIqr0yiiUQKHa33W9vxw1oOCh5Rh5VHraQ&e=
> > > > > > >    <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR277&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=I-IP-iYjMxPh-25Sog01-VziM_wp0v1riNYPfJRiVpM&e=
> > > > > > > >
> > > > > > >    - Binary compatibility. While updating Iceberg, I've noticed
> > > that
> > > > > the
> > > > > > >    split-test was failing:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2D4b64b7014f259be41b26cfb73d3e6e93L199&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=9LPLuFRv8lnWCGDVNU-FoGSLuY_GofaQ-tYA_jgRZPQ&e=
> > > > > > > The
> > > > > > >    two records are now divided over four Spark partitions.
> > > Something
> > > > in
> > > > > > the
> > > > > > >    output has changed since the files are bigger now. Has
> anyone
> > > any
> > > > > idea
> > > > > > > to
> > > > > > >    check what's changed, or a way to check this? The only
> thing I
> > > can
> > > > > > > think of
> > > > > > >    is the checksum mentioned above.
> > > > > > >
> > > > > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > >
> > > > > > > $ parquet-tools cat
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > id = 1
> > > > > > > data = a
> > > > > > >
> > > > > > > $ parquet-tools cat
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > id = 1
> > > > > > > data = a
> > > > > > >
> > > > > > > A binary diff here:
> > > > > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_Fokko_1c209f158299dc2fb5878c5bae4bf6d8&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=sMuaS6b28yXjYslQvpfSzR_ocwBjXx1kM6bXa7Nue_c&e=
> > > > > > >
> > > > > > > Cheers, Fokko
> > > > > > >
> > > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > > chenjunjiedada@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > +1
> > > > > > > > Verified signature, checksum and ran mvn install
> successfully.
> > > > > > > >
> > > > > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > 下午2:05写道：
> > > > > > > > >
> > > > > > > > > +1
> > > > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > > > > "sql/test-only"
> > > > > > > > -Phadoop-3.2
> > > > > > > > >
> > > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <
> gabor@apache.org>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >     Hi everyone,
> > > > > > > > >
> > > > > > > > >     I propose the following RC to be released as official
> > > Apache
> > > > > > > Parquet
> > > > > > > > 1.11.0
> > > > > > > > >     release.
> > > > > > > > >
> > > > > > > > >     The commit id is
> 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > >     * This corresponds to the tag:
> apache-parquet-1.11.0-rc7
> > > > > > > > >     *
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Ftree-252F18519eb8e059865652eee3ff0e8593f126701da4-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DToLFrTB9lU-252FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=UVepLy1MDaX4CT1EPcDESsF_lCp6B_Wf73oJw4j_xnE&e=
> > > > > > > > >
> > > > > > > > >     The release tarball, signature, and checksums are here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fdist.apache.org-252Frepos-252Fdist-252Fdev-252Fparquet-252Fapache-2Dparquet-2D1.11.0-2Drc7-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DMPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=ZwnVnpNGRVFQ3_Hw_sJoUBl6U3CCbT0-uTRMzUQiKJc&e=
> > > > > > > > >
> > > > > > > > >     You can find the KEYS file here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fapache.org-252Fdist-252Fparquet-252FKEYS-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DIwG4MUGsP2lVzlD4bwZUEPuEAPUg-252FHXRYtxc5CQupBM-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=RA0T1Q_BTgA6gwN8EK2CBeZ0nf7340zDgEMadjjqXmQ&e=
> > > > > > > > >
> > > > > > > > >     Binary artifacts are staged in Nexus here:
> > > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Frepository.apache.org-252Fcontent-252Fgroups-252Fstaging-252Forg-252Fapache-252Fparquet-252F-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DlHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=kdM7O8WCtNwj3f7wg3YHQZu2kAaBfh4QjWfG3i5b690&e=
> > > > > > > > >
> > > > > > > > >     This release includes the changes listed at:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Fblob-252Fapache-2Dparquet-2D1.11.0-2Drc7-252FCHANGES.md-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3D82BplI3bLAL6qArLHvVoYReZOk-252BboSP655rI8VX5Q5I-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=CoznEc8bzT5Gkp9UNE3EwFMcEadunf3b2ewl8BcbNjI&s=Pg6nebaAqfj7qh-_b_3PStcrWu-dpBVbjtY9OLp4_G4&e=
> > > > > > > > >
> > > > > > > > >     Please download, verify, and test.
> > > > > > > > >
> > > > > > > > >     Please vote in the next 72 hours.
> > > > > > > > >
> > > > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > > > >     [ ] +0
> > > > > > > > >     [ ] -1 Do not release this because...
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>


-- 
Xinli Shang

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Michael Heuer <he...@gmail.com>.

Clirr fails the binary incompatibility check against 1.10.1

parquet-mr (HEAD detached at apache-parquet-1.11.0-rc7)
$ mvn clirr:check -DcomparisonArtifacts=1.10.1
…
[INFO] --- clirr-maven-plugin:2.6.1:check (default-cli) @ parquet-common ---
[INFO] artifact org.apache.parquet:parquet-common: checking for updates from jitpack.io
[INFO] artifact org.apache.parquet:parquet-common: checking for updates from central
[INFO] Comparing to version: 1.10.1
[ERROR] 7009: org.apache.parquet.bytes.ByteBufferInputStream: Accessibility of method 'public ByteBufferInputStream()' has been decreased from public to package
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache Parquet MR 1.11.0:
[INFO]
[INFO] Apache Parquet MR .................................. SUCCESS [  2.052 s]
[INFO] Apache Parquet Format Structures ................... SUCCESS [  7.035 s]
[INFO] Apache Parquet Generator ........................... SUCCESS [  1.872 s]
[INFO] Apache Parquet Common .............................. FAILURE [  1.478 s]
...


> On Nov 22, 2019, at 2:23 AM, Gabor Szadovszky <ga...@apache.org> wrote:
> 
> Ryan,
> I would not trust our compatibility checks (semver) too much. Currently, it
> is configured to compare our current version to 1.7.0. It means anything
> that is added since 1.7.0 and then broke in a later release won't be
> caught. In addition, many packages are excluded from the check because of
> different reasons. For example org/apache/parquet/schema/** is excluded so
> if it would really be an API compatibility issue we certainly wouldn't
> catch it.
> 
> Michael,
> It fails because of a NoSuchMethodError pointing to a method that is newly
> introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> So, I'm quite sure it is a classpath issue. It seems that the 1.11 version
> of the parquet-column jar is not on the classpath.
> 
> 
> On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com> wrote:
> 
>> The dependency versions are consistent in our artifact
>> 
>> $ mvn dependency:tree | grep parquet
>> [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
>> [INFO] |     \-
>> org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
>> [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
>> [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
>> [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
>> [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
>> [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
>> 
>> The latter error
>> 
>> Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task
>> 0.0 in stage 0.0 (TID 0, localhost, executor driver):
>> java.lang.NoSuchMethodError:
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>        at
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>> 
>> occurs when I attempt to run via spark-submit on Spark 2.4.4
>> 
>> $ spark-submit --version
>> Welcome to
>>      ____              __
>>     / __/__  ___ _____/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>>      /_/
>> 
>> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_191
>> Branch
>> Compiled by user  on 2019-08-27T21:21:38Z
>> Revision
>> Url
>> Type --help for more information.
>> 
>> 
>> 
>>> On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
>> wrote:
>>> 
>>> Thanks for looking into it, Nandor. That doesn't sound like a problem
>> with
>>> Parquet, but a problem with the test environment since parquet-avro
>> depends
>>> on a newer API method.
>>> 
>>> On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
>> <nk...@cloudera.com.invalid>
>>> wrote:
>>> 
>>>> I'm not sure that this is a binary compatibility issue. The missing
>> builder
>>>> method was recently added in 1.11.0 with the introduction of the new
>>>> logical type API, while the original version (one with a single
>>>> OriginalType input parameter called before by AvroSchemaConverter) of
>> this
>>>> method is kept untouched. It seems to me that the Parquet version on
>> Spark
>>>> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
>> still
>>>> on an older version.
>>>> 
>>>> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com>
>> wrote:
>>>> 
>>>>> Perhaps not strictly necessary to say, but if this particular
>>>>> compatibility break between 1.10 and 1.11 was intentional, and no other
>>>>> compatibility breaks are found, I would vote -1 (non-binding) on this
>> RC
>>>>> such that we might go back and revisit the changes to preserve
>>>>> compatibility.
>>>>> 
>>>>> I am not sure there is presently enough motivation in the Spark project
>>>>> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
>>>>> dependency version to 1.11.x.
>>>>> 
>>>>>  michael
>>>>> 
>>>>> 
>>>>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
>>>>> wrote:
>>>>>> 
>>>>>> Gabor, shouldn't Parquet be binary compatible for public APIs? From
>> the
>>>>>> stack trace, it looks like this 1.11.0 RC breaks binary compatibility
>>>> in
>>>>>> the type builders.
>>>>>> 
>>>>>> Looks like this should have been caught by the binary compatibility
>>>>> checks.
>>>>>> 
>>>>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>>> Hi Michael,
>>>>>>> 
>>>>>>> Unfortunately, I don't have too much experience on Spark. But if
>> spark
>>>>> uses
>>>>>>> the parquet-mr library in an embedded way (that's how Hive uses it)
>> it
>>>>> is
>>>>>>> required to re-build Spark with 1.11 RC parquet-mr.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> It appears a provided scope dependency on spark-sql leaks old
>> parquet
>>>>>>>> versions was causing the runtime error below.  After including new
>>>>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
>>>>> addition
>>>>>>>> to parquet-avro, which we already have at compile scope), our build
>>>>>>>> succeeds.
>>>>>>>> 
>>>>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
>>>>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
>>>>>>>> 
>>>>>>>> However, when running via spark-submit I run into a similar runtime
>>>>> error
>>>>>>>> 
>>>>>>>> Caused by: java.lang.NoSuchMethodError:
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>>>>>>>>      at
>>>>>>>> 
>>>>> 
>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> 
>>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>>>      at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>>>      at
>>>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>>>      at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>>>      at
>>>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>>>      at
>>>>>>>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>      at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>      at java.lang.Thread.run(Thread.java:748)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Will bumping our library dependency version to 1.11 require a new
>>>>> version
>>>>>>>> of Spark, built against Parquet 1.11?
>>>>>>>> 
>>>>>>>> Please accept my apologies if this is heading out-of-scope for the
>>>>>>> Parquet
>>>>>>>> mailing list.
>>>>>>>> 
>>>>>>>> michael
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I am willing to do some benchmarking on genomic data at scale but
>> am
>>>>>>> not
>>>>>>>> quite sure what the Spark target version for 1.11.0 might be.  Will
>>>>>>> Parquet
>>>>>>>> 1.11.0 be compatible in Spark 2.4.x?
>>>>>>>>> 
>>>>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
>>>>>>>>> 
>>>>>>>>> …
>>>>>>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
>>>>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
>>>>>>>>>    at
>>>>>>>> 
>>>>> 
>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> 
>>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>>>>    at org.apache.spark.internal.io
>>>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>>>>    at
>>>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>>>>    at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>>>>    at
>>>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>>>>    at
>>>>>>>> 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>>>    at
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>>>    at java.lang.Thread.run(Thread.java:748)
>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
>>>>>>>>>    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>>>>>>    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>>>>    at
>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>>>>>>>    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>>>> 
>>>>>>>>> michael
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks, Fokko.
>>>>>>>>>> 
>>>>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
>> have
>>>>>>>> enough
>>>>>>>>>> time to do that in the next couple of weeks.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Gabor
>>>>>>>>>> 
>>>>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
>>>>>>> <fokko@driesprong.frl
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to
>> +1
>>>>>>>>>>> (non-binding).
>>>>>>>>>>> 
>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>> 
>>>>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
>>>>>>>> <rb...@netflix.com.invalid>
>>>>>>>>>>> 
>>>>>>>>>>>> Gabor, what I meant was: have we tried this with real data to
>> see
>>>>>>> the
>>>>>>>>>>>> effect? I think those results would be helpful.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
>>>>> gabor@apache.org
>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It is not easy to calculate. For the column indexes feature we
>>>>>>>>>>> introduced
>>>>>>>>>>>>> two new structures saved before the footer: column indexes and
>>>>>>> offset
>>>>>>>>>>>>> indexes. If the min/max values are not too long, then the
>>>>>>> truncation
>>>>>>>>>>>> might
>>>>>>>>>>>>> not decrease the file size because of the offset indexes.
>>>>> Moreover,
>>>>>>>> we
>>>>>>>>>>>> also
>>>>>>>>>>>>> introduced parquet.page.row.count.limit which might increase
>> the
>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>> pages which leads to increasing the file size.
>>>>>>>>>>>>> The footer itself is also changed and we are saving more values
>>>> in
>>>>>>>> it:
>>>>>>>>>>>> the
>>>>>>>>>>>>> offset values to the column/offset indexes, the new logical
>> type
>>>>>>>>>>>>> structures, the CRC checksums (we might have some others).
>>>>>>>>>>>>> So, the size of the files with small amount of data will be
>>>>>>> increased
>>>>>>>>>>>>> (because of the larger footer). The size of the files where the
>>>>>>>> values
>>>>>>>>>>>> can
>>>>>>>>>>>>> be encoded very well (RLE) will probably be increased (because
>>>> we
>>>>>>>> will
>>>>>>>>>>>> have
>>>>>>>>>>>>> more pages). The size of some files where the values are long
>>>>>>>> (>64bytes
>>>>>>>>>>>> by
>>>>>>>>>>>>> default) might be decreased because of truncating the min/max
>>>>>>> values.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
>>>>>>> <rblue@netflix.com.invalid
>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
>>>>>>> non-test
>>>>>>>>>>>> data
>>>>>>>>>>>>>> file? It should be easy to validate that this doesn't
>> introduce
>>>>> an
>>>>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
>>>>> actually
>>>>>>>> be
>>>>>>>>>>>>>> smaller since the column indexes are truncated and page stats
>>>> are
>>>>>>>>>>> not.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Fokko,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For the first point. The referenced constructor is private
>> and
>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>> uses
>>>>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
>>>>>>> parquet-mr
>>>>>>>>>>>>> shall
>>>>>>>>>>>>>>> not keep private methods only because of clients might use
>>>> them
>>>>>>> via
>>>>>>>>>>>>>>> reflection.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
>>>> write
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> default because the benchmarks did not show significant
>>>>>>> performance
>>>>>>>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
>>>>> for
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
>>>>> indexes,
>>>>>>>>>>> CRC
>>>>>>>>>>>>>>> checksum, removing the statistics from the page headers and
>>>>> maybe
>>>>>>>>>>>> other
>>>>>>>>>>>>>>> changes that impact file size. If only file size is in
>>>> question
>>>>> I
>>>>>>>>>>>>> cannot
>>>>>>>>>>>>>>> see a breaking change here.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>>>>>>>>>> <fokko@driesprong.frl
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
>>>> things:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - We've broken backward compatibility of the constructor of
>>>>>>>>>>>>>>>> ColumnChunkPageWriteStore
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> This required a change
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be
>>>> a
>>>>>>>>>>>> new
>>>>>>>>>>>>>> RC,
>>>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>>>> submitted a patch:
>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
>>>>>>>>>>>> checksums
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>> enabled by default:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
>>>>>>>>>>>> default:
>>>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
>>>>>>>>>>>> that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> split-test was failing:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> two records are now divided over four Spark partitions.
>>>>>>>>>>>> Something
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> output has changed since the files are bigger now. Has
>> anyone
>>>>>>>>>>>> any
>>>>>>>>>>>>>> idea
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> check what's changed, or a way to check this? The only thing
>>>> I
>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> think of
>>>>>>>>>>>>>>>> is the checksum mentioned above.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> A binary diff here:
>>>>>>>>>>>>>>>> 
>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>>>>>>>>>> chenjunjiedada@gmail.com
>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
>>>> successfully.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>>>>>>>>>>> 下午2:05写道：
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>>>>>>>>>> "sql/test-only"
>>>>>>>>>>>>>>>>> -Phadoop-3.2
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
>>>> gabor@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I propose the following RC to be released as official
>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>> Parquet
>>>>>>>>>>>>>>>>> 1.11.0
>>>>>>>>>>>>>>>>>> release.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The release tarball, signature, and checksums are here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> You can find the KEYS file here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
>>>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This release includes the changes listed at:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Please download, verify, and test.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Please vote in the next 72 hours.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Nandor Kollar <nk...@cloudera.com.INVALID>.

Michael,

Indeed it seems that during compile time Parquet versions seems to be
consistent. However, the exception happens when one Parquet module calls a
method in another Parquet module: parquet-avro calls builder methods in
parquet-column. I can't imagine how whis call could break with consistent
Parquet versions, Parquet wouldn't even build if it would be so.
Could you please check the classpath of the failing task? If you're running
Spark on Yarn, you can get the logs via yarn logs -applicationId <app ID> then
you can find the classpath somewhere at the beginning of the log file. Are
Parquet artifact versions consistent there too?

Nandor

On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Gabor,
>
> 1.7.0 was the first version using the org.apache.parquet packages, so
> that's the correct base version for compatibility checks. The exclusions in
> the POM are classes that the Parquet community does not consider public. We
> rely on these checks to highlight binary incompatibilities, and then we
> discuss them on this list or in the dev sync. If the class is internal, we
> add an exclusion for it.
>
> I know you're familiar with this process since we've talked about it
> before. I also know that you'd rather have more strict binary
> compatibility, but until we have someone with the time to do some
> maintenance and build a public API module, I'm afraid that's what we have
> to work with.
>
> Michael,
>
> I hope the context above is helpful and explains why running a binary
> compatibility check tool will find incompatible changes. We allow binary
> incompatible changes to internal classes and modules, like parquet-common.
>
> On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
>
> > Ryan,
> > I would not trust our compatibility checks (semver) too much. Currently,
> it
> > is configured to compare our current version to 1.7.0. It means anything
> > that is added since 1.7.0 and then broke in a later release won't be
> > caught. In addition, many packages are excluded from the check because of
> > different reasons. For example org/apache/parquet/schema/** is excluded
> so
> > if it would really be an API compatibility issue we certainly wouldn't
> > catch it.
> >
> > Michael,
> > It fails because of a NoSuchMethodError pointing to a method that is
> newly
> > introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> version
> > of the parquet-column jar is not on the classpath.
> >
> >
> > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com> wrote:
> >
> > > The dependency versions are consistent in our artifact
> > >
> > > $ mvn dependency:tree | grep parquet
> > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > [INFO] |     \-
> > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > >
> > > The latter error
> > >
> > > Caused by: org.apache.spark.SparkException: Job aborted due to stage
> > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost
> > task
> > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > java.lang.NoSuchMethodError:
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > >
> > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > >
> > > $ spark-submit --version
> > > Welcome to
> > >       ____              __
> > >      / __/__  ___ _____/ /__
> > >     _\ \/ _ \/ _ `/ __/  '_/
> > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > >       /_/
> > >
> > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> 1.8.0_191
> > > Branch
> > > Compiled by user  on 2019-08-27T21:21:38Z
> > > Revision
> > > Url
> > > Type --help for more information.
> > >
> > >
> > >
> > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
> > > wrote:
> > > >
> > > > Thanks for looking into it, Nandor. That doesn't sound like a problem
> > > with
> > > > Parquet, but a problem with the test environment since parquet-avro
> > > depends
> > > > on a newer API method.
> > > >
> > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > <nk...@cloudera.com.invalid>
> > > > wrote:
> > > >
> > > >> I'm not sure that this is a binary compatibility issue. The missing
> > > builder
> > > >> method was recently added in 1.11.0 with the introduction of the new
> > > >> logical type API, while the original version (one with a single
> > > >> OriginalType input parameter called before by AvroSchemaConverter)
> of
> > > this
> > > >> method is kept untouched. It seems to me that the Parquet version on
> > > Spark
> > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
> > > still
> > > >> on an older version.
> > > >>
> > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Perhaps not strictly necessary to say, but if this particular
> > > >>> compatibility break between 1.10 and 1.11 was intentional, and no
> > other
> > > >>> compatibility breaks are found, I would vote -1 (non-binding) on
> this
> > > RC
> > > >>> such that we might go back and revisit the changes to preserve
> > > >>> compatibility.
> > > >>>
> > > >>> I am not sure there is presently enough motivation in the Spark
> > project
> > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> Parquet
> > > >>> dependency version to 1.11.x.
> > > >>>
> > > >>>   michael
> > > >>>
> > > >>>
> > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rblue@netflix.com.INVALID
> >
> > > >>> wrote:
> > > >>>>
> > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs?
> From
> > > the
> > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > compatibility
> > > >> in
> > > >>>> the type builders.
> > > >>>>
> > > >>>> Looks like this should have been caught by the binary
> compatibility
> > > >>> checks.
> > > >>>>
> > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> gabor@apache.org>
> > > >>> wrote:
> > > >>>>
> > > >>>>> Hi Michael,
> > > >>>>>
> > > >>>>> Unfortunately, I don't have too much experience on Spark. But if
> > > spark
> > > >>> uses
> > > >>>>> the parquet-mr library in an embedded way (that's how Hive uses
> it)
> > > it
> > > >>> is
> > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Gabor
> > > >>>>>
> > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <heuermh@gmail.com
> >
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> It appears a provided scope dependency on spark-sql leaks old
> > > parquet
> > > >>>>>> versions was causing the runtime error below.  After including
> new
> > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
> > > >>> addition
> > > >>>>>> to parquet-avro, which we already have at compile scope), our
> > build
> > > >>>>>> succeeds.
> > > >>>>>>
> > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > >>>>>>
> > > >>>>>> However, when running via spark-submit I run into a similar
> > runtime
> > > >>> error
> > > >>>>>>
> > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>>
> > > >>
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > >>>>>>       at
> > > >>>>>>
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > >>>>>>       at
> > > >>>>>>
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > >>>>>>       at
> > > >>>>>>
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Will bumping our library dependency version to 1.11 require a
> new
> > > >>> version
> > > >>>>>> of Spark, built against Parquet 1.11?
> > > >>>>>>
> > > >>>>>> Please accept my apologies if this is heading out-of-scope for
> the
> > > >>>>> Parquet
> > > >>>>>> mailing list.
> > > >>>>>>
> > > >>>>>>  michael
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <heuermh@GMAIL.COM
> >
> > > >>> wrote:
> > > >>>>>>>
> > > >>>>>>> I am willing to do some benchmarking on genomic data at scale
> but
> > > am
> > > >>>>> not
> > > >>>>>> quite sure what the Spark target version for 1.11.0 might be.
> > Will
> > > >>>>> Parquet
> > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > >>>>>>>
> > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > > >>>>>>>
> > > >>>>>>> …
> > > >>>>>>> D 0, localhost, executor driver):
> java.lang.NoClassDefFoundError:
> > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>>
> > > >>
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > >>>>>>>     at
> > > >>>>>>
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > >>>>>>>     at
> > > >>>>>>
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > >>>>>>>     at
> > > >>>>>>
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > >>>>>>>     at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > >>>>>>>     at
> > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > >>>>>>>
> > > >>>>>>> michael
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> gabor@apache.org
> > >
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Thanks, Fokko.
> > > >>>>>>>>
> > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
> > > have
> > > >>>>>> enough
> > > >>>>>>>> time to do that in the next couple of weeks.
> > > >>>>>>>>
> > > >>>>>>>> Cheers,
> > > >>>>>>>> Gabor
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > >>>>> <fokko@driesprong.frl
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote
> to
> > > +1
> > > >>>>>>>>> (non-binding).
> > > >>>>>>>>>
> > > >>>>>>>>> Cheers, Fokko
> > > >>>>>>>>>
> > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > >>>>>> <rb...@netflix.com.invalid>
> > > >>>>>>>>>
> > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real data
> to
> > > see
> > > >>>>> the
> > > >>>>>>>>>> effect? I think those results would be helpful.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > >>> gabor@apache.org
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Ryan,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> It is not easy to calculate. For the column indexes feature
> > we
> > > >>>>>>>>> introduced
> > > >>>>>>>>>>> two new structures saved before the footer: column indexes
> > and
> > > >>>>> offset
> > > >>>>>>>>>>> indexes. If the min/max values are not too long, then the
> > > >>>>> truncation
> > > >>>>>>>>>> might
> > > >>>>>>>>>>> not decrease the file size because of the offset indexes.
> > > >>> Moreover,
> > > >>>>>> we
> > > >>>>>>>>>> also
> > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> increase
> > > the
> > > >>>>>> number
> > > >>>>>>>>>> of
> > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > >>>>>>>>>>> The footer itself is also changed and we are saving more
> > values
> > > >> in
> > > >>>>>> it:
> > > >>>>>>>>>> the
> > > >>>>>>>>>>> offset values to the column/offset indexes, the new logical
> > > type
> > > >>>>>>>>>>> structures, the CRC checksums (we might have some others).
> > > >>>>>>>>>>> So, the size of the files with small amount of data will be
> > > >>>>> increased
> > > >>>>>>>>>>> (because of the larger footer). The size of the files where
> > the
> > > >>>>>> values
> > > >>>>>>>>>> can
> > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > (because
> > > >> we
> > > >>>>>> will
> > > >>>>>>>>>> have
> > > >>>>>>>>>>> more pages). The size of some files where the values are
> long
> > > >>>>>> (>64bytes
> > > >>>>>>>>>> by
> > > >>>>>>>>>>> default) might be decreased because of truncating the
> min/max
> > > >>>>> values.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Regards,
> > > >>>>>>>>>>> Gabor
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > >>>>> <rblue@netflix.com.invalid
> > > >>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> > > >>>>> non-test
> > > >>>>>>>>>> data
> > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > > introduce
> > > >>> an
> > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
> > > >>> actually
> > > >>>>>> be
> > > >>>>>>>>>>>> smaller since the column indexes are truncated and page
> > stats
> > > >> are
> > > >>>>>>>>> not.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Fokko,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> private
> > > and
> > > >>>>>>>>>> Iceberg
> > > >>>>>>>>>>>> uses
> > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
> > > >>>>> parquet-mr
> > > >>>>>>>>>>> shall
> > > >>>>>>>>>>>>> not keep private methods only because of clients might
> use
> > > >> them
> > > >>>>> via
> > > >>>>>>>>>>>>> reflection.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> checksum
> > > >> write
> > > >>>>>>>>>>> enabled
> > > >>>>>>>>>>>> by
> > > >>>>>>>>>>>>> default because the benchmarks did not show significant
> > > >>>>> performance
> > > >>>>>>>>>>>>> penalties. See
> > https://github.com/apache/parquet-mr/pull/647
> > > >>> for
> > > >>>>>>>>>>>> details.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
> > > >>> indexes,
> > > >>>>>>>>> CRC
> > > >>>>>>>>>>>>> checksum, removing the statistics from the page headers
> and
> > > >>> maybe
> > > >>>>>>>>>> other
> > > >>>>>>>>>>>>> changes that impact file size. If only file size is in
> > > >> question
> > > >>> I
> > > >>>>>>>>>>> cannot
> > > >>>>>>>>>>>>> see a breaking change here.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>> Gabor
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > >>>>>>>>>> <fokko@driesprong.frl
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
> > > >> things:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - We've broken backward compatibility of the constructor
> > of
> > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > >>>>>>>>>>>>>>> .
> > > >>>>>>>>>>>>>> This required a change
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there
> will
> > be
> > > >> a
> > > >>>>>>>>>> new
> > > >>>>>>>>>>>> RC,
> > > >>>>>>>>>>>>>> I've
> > > >>>>>>>>>>>>>> submitted a patch:
> > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
> > > >>>>>>>>>> checksums
> > > >>>>>>>>>>>> are
> > > >>>>>>>>>>>>>> enabled by default:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > >>>>>>>>>>>>>> This
> > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
> > > >>>>>>>>>> default:
> > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> > noticed
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> split-test was failing:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > >>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>> two records are now divided over four Spark partitions.
> > > >>>>>>>>>> Something
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> output has changed since the files are bigger now. Has
> > > anyone
> > > >>>>>>>>>> any
> > > >>>>>>>>>>>> idea
> > > >>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The only
> > thing
> > > >> I
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>>>> think of
> > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > >>>>>>>>>>>>>> id = 1
> > > >>>>>>>>>>>>>> data = a
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >>>>>>>>>>>>>> id = 1
> > > >>>>>>>>>>>>>> data = a
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> A binary diff here:
> > > >>>>>>>>>>>>>>
> > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Cheers, Fokko
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > >>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> +1
> > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > >> successfully.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > > >>>>>>>>> 下午2:05写道：
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> +1
> > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > >>>>>>>>>>>>> "sql/test-only"
> > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > >> gabor@apache.org>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I propose the following RC to be released as official
> > > >>>>>>>>>> Apache
> > > >>>>>>>>>>>>>> Parquet
> > > >>>>>>>>>>>>>>> 1.11.0
> > > >>>>>>>>>>>>>>>> release.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The commit id is
> > 18519eb8e059865652eee3ff0e8593f126701da4
> > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> apache-parquet-1.11.0-rc7
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are
> here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > >>>>>>>>>>>>>>>> [ ] +0
> > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> Ryan Blue
> > > >>>>>>>>>>>> Software Engineer
> > > >>>>>>>>>>>> Netflix
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> Ryan Blue
> > > >>>>>>>>>> Software Engineer
> > > >>>>>>>>>> Netflix
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ryan Blue
> > > >>>> Software Engineer
> > > >>>> Netflix
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [RESULT] Release Apache Parquet 1.11.0 RC7

Posted by Ismaël Mejía <ie...@gmail.com>.

Thanks a lot Gabor (and the others) for making this happen !

On Fri, Dec 6, 2019 at 5:55 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks for getting this done, Gabor!
>
> On Fri, Dec 6, 2019 at 12:44 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Thanks, Julien and all of you who have voted.
> > With three binding +1 votes and four non-binding +1 votes (no -1 votes)
> > this release pass.
> > I'll finalize the release in the next hour.
> >
> > Cheers,
> > Gabor
> >
> > On Fri, Dec 6, 2019 at 12:12 AM Julien Le Dem
> > <ju...@wework.com.invalid> wrote:
> >
> > > I verified the signatures
> > > ran the build and test
> > > It looks like the compatibility changes being discussed are not
> blockers.
> > >
> > > +1 (binding)
> > >
> > >
> > > On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > > > Thanks, Zoltan.
> > > >
> > > > I also vote +1 (binding)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > - I have read through the problem reports in this e-mail thread
> (one
> > > > caused
> > > > > by the use of a private method via reflection an another one caused
> > by
> > > > > having mixed versions of the libraries on the classpath) and I am
> > > > convinced
> > > > > that they do not block the release.
> > > > > - Signature and hash of the source tarball are valid.
> > > > > - The specified git hash matches the specified git tag.
> > > > > - The contents of the source tarball match the contents of the git
> > repo
> > > > at
> > > > > the specified tag.
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > >
> > > > > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <
> gabor@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Created https://issues.apache.org/jira/browse/PARQUET-1703 to
> > track
> > > > > this.
> > > > > >
> > > > > > Back to the RC. Anyone from the PMC willing to vote?
> > > > > >
> > > > > > Cheers,
> > > > > > Gabor
> > > > > >
> > > > > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue
> > <rblue@netflix.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Gabor, good point about not being able to check new APIs.
> > Updating
> > > > the
> > > > > > > previous version would also allow us to get rid of temporary
> > > > > exclusions,
> > > > > > > like the one you pointed out for schema. It would be great to
> > > improve
> > > > > > what
> > > > > > > we catch there.
> > > > > > >
> > > > > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <
> > gabor@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Ryan,
> > > > > > > >
> > > > > > > > It is a different topic but would like to reflect shortly.
> > > > > > > > I understand that 1.7.0 was the first apache release. The
> > problem
> > > > > with
> > > > > > > > doing the compatibility checks comparing to 1.7.0 is that we
> > can
> > > > > easily
> > > > > > > add
> > > > > > > > incompatibilities in API which are added after 1.7.0. For
> > > example:
> > > > > > > Adding a
> > > > > > > > new class for public use in 1.8.0 then removing it in 1.9.0.
> > The
> > > > > > > > compatibility check would not discover this breaking change.
> > So,
> > > I
> > > > > > > think, a
> > > > > > > > better approach would be to always compare to the previous
> > minor
> > > > > > release
> > > > > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > > > > As I've written before, even org/apache/parquet/schema/** is
> > > > excluded
> > > > > > > from
> > > > > > > > the compatibility check. As far as I know this is public API.
> > > So, I
> > > > > am
> > > > > > > not
> > > > > > > > sure that only packages that are not part of the public API
> are
> > > > > > excluded.
> > > > > > > >
> > > > > > > > Let's discuss this on the next parquet sync.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
> > > > <rblue@netflix.com.invalid
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Gabor,
> > > > > > > > >
> > > > > > > > > 1.7.0 was the first version using the org.apache.parquet
> > > > packages,
> > > > > so
> > > > > > > > > that's the correct base version for compatibility checks.
> The
> > > > > > > exclusions
> > > > > > > > in
> > > > > > > > > the POM are classes that the Parquet community does not
> > > consider
> > > > > > > public.
> > > > > > > > We
> > > > > > > > > rely on these checks to highlight binary incompatibilities,
> > and
> > > > > then
> > > > > > we
> > > > > > > > > discuss them on this list or in the dev sync. If the class
> is
> > > > > > internal,
> > > > > > > > we
> > > > > > > > > add an exclusion for it.
> > > > > > > > >
> > > > > > > > > I know you're familiar with this process since we've talked
> > > about
> > > > > it
> > > > > > > > > before. I also know that you'd rather have more strict
> binary
> > > > > > > > > compatibility, but until we have someone with the time to
> do
> > > some
> > > > > > > > > maintenance and build a public API module, I'm afraid
> that's
> > > what
> > > > > we
> > > > > > > have
> > > > > > > > > to work with.
> > > > > > > > >
> > > > > > > > > Michael,
> > > > > > > > >
> > > > > > > > > I hope the context above is helpful and explains why
> running
> > a
> > > > > binary
> > > > > > > > > compatibility check tool will find incompatible changes. We
> > > allow
> > > > > > > binary
> > > > > > > > > incompatible changes to internal classes and modules, like
> > > > > > > > parquet-common.
> > > > > > > > >
> > > > > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > > > > gabor@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Ryan,
> > > > > > > > > > I would not trust our compatibility checks (semver) too
> > much.
> > > > > > > > Currently,
> > > > > > > > > it
> > > > > > > > > > is configured to compare our current version to 1.7.0. It
> > > means
> > > > > > > > anything
> > > > > > > > > > that is added since 1.7.0 and then broke in a later
> release
> > > > won't
> > > > > > be
> > > > > > > > > > caught. In addition, many packages are excluded from the
> > > check
> > > > > > > because
> > > > > > > > of
> > > > > > > > > > different reasons. For example
> org/apache/parquet/schema/**
> > > is
> > > > > > > excluded
> > > > > > > > > so
> > > > > > > > > > if it would really be an API compatibility issue we
> > certainly
> > > > > > > wouldn't
> > > > > > > > > > catch it.
> > > > > > > > > >
> > > > > > > > > > Michael,
> > > > > > > > > > It fails because of a NoSuchMethodError pointing to a
> > method
> > > > that
> > > > > > is
> > > > > > > > > newly
> > > > > > > > > > introduced in 1.11. Both the caller and the callee
> shipped
> > by
> > > > > > > > parquet-mr.
> > > > > > > > > > So, I'm quite sure it is a classpath issue. It seems that
> > the
> > > > > 1.11
> > > > > > > > > version
> > > > > > > > > > of the parquet-column jar is not on the classpath.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <
> > > > heuermh@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > The dependency versions are consistent in our artifact
> > > > > > > > > > >
> > > > > > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > > > > > [INFO] |  \-
> > > > org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |     \-
> > > > > > > > > > >
> > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |  +-
> > > > > org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |  |  +-
> > > > > > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |  |  \-
> > > > > > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |  +-
> > > > > org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > > > > > [INFO] |  |  +-
> > > > > > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > > > > > >
> > > > > > > > > > > The latter error
> > > > > > > > > > >
> > > > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted
> > due
> > > > to
> > > > > > > stage
> > > > > > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most
> recent
> > > > > failure:
> > > > > > > > Lost
> > > > > > > > > > task
> > > > > > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > > > > > java.lang.NoSuchMethodError:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > > >         at
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > > > >
> > > > > > > > > > > occurs when I attempt to run via spark-submit on Spark
> > > 2.4.4
> > > > > > > > > > >
> > > > > > > > > > > $ spark-submit --version
> > > > > > > > > > > Welcome to
> > > > > > > > > > >       ____              __
> > > > > > > > > > >      / __/__  ___ _____/ /__
> > > > > > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > > > > > >       /_/
> > > > > > > > > > >
> > > > > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit
> > Server
> > > > VM,
> > > > > > > > > 1.8.0_191
> > > > > > > > > > > Branch
> > > > > > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > > > > > Revision
> > > > > > > > > > > Url
> > > > > > > > > > > Type --help for more information.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > > > > > <rblue@netflix.com.INVALID
> > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for looking into it, Nandor. That doesn't
> sound
> > > > like a
> > > > > > > > problem
> > > > > > > > > > > with
> > > > > > > > > > > > Parquet, but a problem with the test environment
> since
> > > > > > > parquet-avro
> > > > > > > > > > > depends
> > > > > > > > > > > > on a newer API method.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> I'm not sure that this is a binary compatibility
> > issue.
> > > > The
> > > > > > > > missing
> > > > > > > > > > > builder
> > > > > > > > > > > >> method was recently added in 1.11.0 with the
> > > introduction
> > > > of
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > >> logical type API, while the original version (one
> > with a
> > > > > > single
> > > > > > > > > > > >> OriginalType input parameter called before by
> > > > > > > AvroSchemaConverter)
> > > > > > > > > of
> > > > > > > > > > > this
> > > > > > > > > > > >> method is kept untouched. It seems to me that the
> > > Parquet
> > > > > > > version
> > > > > > > > on
> > > > > > > > > > > Spark
> > > > > > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > > > > > parquet-column
> > > > > > > > is
> > > > > > > > > > > still
> > > > > > > > > > > >> on an older version.
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > > > > > heuermh@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >>> Perhaps not strictly necessary to say, but if this
> > > > > particular
> > > > > > > > > > > >>> compatibility break between 1.10 and 1.11 was
> > > > intentional,
> > > > > > and
> > > > > > > no
> > > > > > > > > > other
> > > > > > > > > > > >>> compatibility breaks are found, I would vote -1
> > > > > (non-binding)
> > > > > > > on
> > > > > > > > > this
> > > > > > > > > > > RC
> > > > > > > > > > > >>> such that we might go back and revisit the changes
> to
> > > > > > preserve
> > > > > > > > > > > >>> compatibility.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> I am not sure there is presently enough motivation
> in
> > > the
> > > > > > Spark
> > > > > > > > > > project
> > > > > > > > > > > >>> for a release after 2.4.4 and before 3.0 in which
> to
> > > bump
> > > > > the
> > > > > > > > > Parquet
> > > > > > > > > > > >>> dependency version to 1.11.x.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>   michael
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > > > > > <rblue@netflix.com.INVALID
> > > > > > > > > >
> > > > > > > > > > > >>> wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for
> > > public
> > > > > > APIs?
> > > > > > > > > From
> > > > > > > > > > > the
> > > > > > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks
> > > binary
> > > > > > > > > > compatibility
> > > > > > > > > > > >> in
> > > > > > > > > > > >>>> the type builders.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Looks like this should have been caught by the
> > binary
> > > > > > > > > compatibility
> > > > > > > > > > > >>> checks.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > > > > > gabor@apache.org>
> > > > > > > > > > > >>> wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>> Hi Michael,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Unfortunately, I don't have too much experience
> on
> > > > Spark.
> > > > > > But
> > > > > > > > if
> > > > > > > > > > > spark
> > > > > > > > > > > >>> uses
> > > > > > > > > > > >>>>> the parquet-mr library in an embedded way (that's
> > how
> > > > > Hive
> > > > > > > uses
> > > > > > > > > it)
> > > > > > > > > > > it
> > > > > > > > > > > >>> is
> > > > > > > > > > > >>>>> required to re-build Spark with 1.11 RC
> parquet-mr.
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Regards,
> > > > > > > > > > > >>>>> Gabor
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > > > > > heuermh@gmail.com
> > > > > > > > > >
> > > > > > > > > > > >>> wrote:
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>> It appears a provided scope dependency on
> > spark-sql
> > > > > leaks
> > > > > > > old
> > > > > > > > > > > parquet
> > > > > > > > > > > >>>>>> versions was causing the runtime error below.
> > After
> > > > > > > including
> > > > > > > > > new
> > > > > > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > > > > > dependencies
> > > > > > > > (in
> > > > > > > > > > > >>> addition
> > > > > > > > > > > >>>>>> to parquet-avro, which we already have at
> compile
> > > > > scope),
> > > > > > > our
> > > > > > > > > > build
> > > > > > > > > > > >>>>>> succeeds.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>
> https://github.com/bigdatagenomics/adam/pull/2232
> > <
> > > > > > > > > > > >>>>>>
> https://github.com/bigdatagenomics/adam/pull/2232
> > >
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> However, when running via spark-submit I run
> into
> > a
> > > > > > similar
> > > > > > > > > > runtime
> > > > > > > > > > > >>> error
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>
> > > > > > > > > >
> > > > > > >
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > >
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > >
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > >
> > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > > > >>>>>>       at
> > > > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > >
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > >
> > > > > > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > > > >>>>>>       at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Will bumping our library dependency version to
> > 1.11
> > > > > > require
> > > > > > > a
> > > > > > > > > new
> > > > > > > > > > > >>> version
> > > > > > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Please accept my apologies if this is heading
> > > > > out-of-scope
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > >>>>> Parquet
> > > > > > > > > > > >>>>>> mailing list.
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>  michael
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > > > > > heuermh@GMAIL.COM
> > > > > > > > > >
> > > > > > > > > > > >>> wrote:
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>> I am willing to do some benchmarking on genomic
> > > data
> > > > at
> > > > > > > scale
> > > > > > > > > but
> > > > > > > > > > > am
> > > > > > > > > > > >>>>> not
> > > > > > > > > > > >>>>>> quite sure what the Spark target version for
> > 1.11.0
> > > > > might
> > > > > > > be.
> > > > > > > > > > Will
> > > > > > > > > > > >>>>> Parquet
> > > > > > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at
> runtime
> > in
> > > > our
> > > > > > > build
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>> …
> > > > > > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > > > > > java.lang.NoClassDefFoundError:
> > > > > > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>
> > > > > > > > > >
> > > > > > >
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > >
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > > >>>>>>
> > > > > > > >
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > >
> > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > > > >>>>>>>     at
> > > > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > >
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > >
> > > > > > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > > > > > >>
> > > > > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > > > > > > >>>>>>>     at
> > > > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>> michael
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky
> <
> > > > > > > > > gabor@apache.org
> > > > > > > > > > >
> > > > > > > > > > > >>>>> wrote:
> > > > > > > > > > > >>>>>>>>
> > > > > > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > > > > > >>>>>>>>
> > > > > > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> > > > > afraid, I
> > > > > > > > won't
> > > > > > > > > > > have
> > > > > > > > > > > >>>>>> enough
> > > > > > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > > > > > >>>>>>>>
> > > > > > > > > > > >>>>>>>> Cheers,
> > > > > > > > > > > >>>>>>>> Gabor
> > > > > > > > > > > >>>>>>>>
> > > > > > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong,
> > Fokko
> > > > > > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>>> wrote:
> > > > > > > > > > > >>>>>>>>
> > > > > > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to
> > > > change
> > > > > my
> > > > > > > > vote
> > > > > > > > > to
> > > > > > > > > > > +1
> > > > > > > > > > > >>>>>>>>> (non-binding).
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this
> > with
> > > > > real
> > > > > > > data
> > > > > > > > > to
> > > > > > > > > > > see
> > > > > > > > > > > >>>>> the
> > > > > > > > > > > >>>>>>>>>> effect? I think those results would be
> > helpful.
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor
> > > Szadovszky
> > > > <
> > > > > > > > > > > >>> gabor@apache.org
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column
> > > > indexes
> > > > > > > > feature
> > > > > > > > > > we
> > > > > > > > > > > >>>>>>>>> introduced
> > > > > > > > > > > >>>>>>>>>>> two new structures saved before the footer:
> > > > column
> > > > > > > > indexes
> > > > > > > > > > and
> > > > > > > > > > > >>>>> offset
> > > > > > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too
> > > long,
> > > > > then
> > > > > > > the
> > > > > > > > > > > >>>>> truncation
> > > > > > > > > > > >>>>>>>>>> might
> > > > > > > > > > > >>>>>>>>>>> not decrease the file size because of the
> > > offset
> > > > > > > indexes.
> > > > > > > > > > > >>> Moreover,
> > > > > > > > > > > >>>>>> we
> > > > > > > > > > > >>>>>>>>>> also
> > > > > > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit
> which
> > > > might
> > > > > > > > > increase
> > > > > > > > > > > the
> > > > > > > > > > > >>>>>> number
> > > > > > > > > > > >>>>>>>>>> of
> > > > > > > > > > > >>>>>>>>>>> pages which leads to increasing the file
> > size.
> > > > > > > > > > > >>>>>>>>>>> The footer itself is also changed and we
> are
> > > > saving
> > > > > > > more
> > > > > > > > > > values
> > > > > > > > > > > >> in
> > > > > > > > > > > >>>>>> it:
> > > > > > > > > > > >>>>>>>>>> the
> > > > > > > > > > > >>>>>>>>>>> offset values to the column/offset indexes,
> > the
> > > > new
> > > > > > > > logical
> > > > > > > > > > > type
> > > > > > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might
> have
> > > some
> > > > > > > > others).
> > > > > > > > > > > >>>>>>>>>>> So, the size of the files with small amount
> > of
> > > > data
> > > > > > > will
> > > > > > > > be
> > > > > > > > > > > >>>>> increased
> > > > > > > > > > > >>>>>>>>>>> (because of the larger footer). The size of
> > the
> > > > > files
> > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > > >>>>>> values
> > > > > > > > > > > >>>>>>>>>> can
> > > > > > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> > > > > increased
> > > > > > > > > > (because
> > > > > > > > > > > >> we
> > > > > > > > > > > >>>>>> will
> > > > > > > > > > > >>>>>>>>>> have
> > > > > > > > > > > >>>>>>>>>>> more pages). The size of some files where
> the
> > > > > values
> > > > > > > are
> > > > > > > > > long
> > > > > > > > > > > >>>>>> (>64bytes
> > > > > > > > > > > >>>>>>>>>> by
> > > > > > > > > > > >>>>>>>>>>> default) might be decreased because of
> > > truncating
> > > > > the
> > > > > > > > > min/max
> > > > > > > > > > > >>>>> values.
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>> Regards,
> > > > > > > > > > > >>>>>>>>>>> Gabor
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>>>>>> wrote:
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the
> additional
> > > > > overhead
> > > > > > > > for a
> > > > > > > > > > > >>>>> non-test
> > > > > > > > > > > >>>>>>>>>> data
> > > > > > > > > > > >>>>>>>>>>>> file? It should be easy to validate that
> > this
> > > > > > doesn't
> > > > > > > > > > > introduce
> > > > > > > > > > > >>> an
> > > > > > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some
> > > cases,
> > > > it
> > > > > > > > should
> > > > > > > > > > > >>> actually
> > > > > > > > > > > >>>>>> be
> > > > > > > > > > > >>>>>>>>>>>> smaller since the column indexes are
> > truncated
> > > > and
> > > > > > > page
> > > > > > > > > > stats
> > > > > > > > > > > >> are
> > > > > > > > > > > >>>>>>>>> not.
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor
> > > Szadovszky
> > > > > > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid>
> > > wrote:
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> For the first point. The referenced
> > > constructor
> > > > > is
> > > > > > > > > private
> > > > > > > > > > > and
> > > > > > > > > > > >>>>>>>>>> Iceberg
> > > > > > > > > > > >>>>>>>>>>>> uses
> > > > > > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking
> > > > change. I
> > > > > > > > think,
> > > > > > > > > > > >>>>> parquet-mr
> > > > > > > > > > > >>>>>>>>>>> shall
> > > > > > > > > > > >>>>>>>>>>>>> not keep private methods only because of
> > > > clients
> > > > > > > might
> > > > > > > > > use
> > > > > > > > > > > >> them
> > > > > > > > > > > >>>>> via
> > > > > > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having
> > the
> > > > CRC
> > > > > > > > > checksum
> > > > > > > > > > > >> write
> > > > > > > > > > > >>>>>>>>>>> enabled
> > > > > > > > > > > >>>>>>>>>>>> by
> > > > > > > > > > > >>>>>>>>>>>>> default because the benchmarks did not
> show
> > > > > > > significant
> > > > > > > > > > > >>>>> performance
> > > > > > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > > > > > > >>> for
> > > > > > > > > > > >>>>>>>>>>>> details.
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is
> > > > introducing
> > > > > > > > column
> > > > > > > > > > > >>> indexes,
> > > > > > > > > > > >>>>>>>>> CRC
> > > > > > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from
> the
> > > page
> > > > > > > headers
> > > > > > > > > and
> > > > > > > > > > > >>> maybe
> > > > > > > > > > > >>>>>>>>>> other
> > > > > > > > > > > >>>>>>>>>>>>> changes that impact file size. If only
> file
> > > > size
> > > > > is
> > > > > > > in
> > > > > > > > > > > >> question
> > > > > > > > > > > >>> I
> > > > > > > > > > > >>>>>>>>>>> cannot
> > > > > > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM
> Driesprong,
> > > > Fokko
> > > > > > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side
> > > (non-binding)
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0,
> > and
> > > > > found
> > > > > > > > three
> > > > > > > > > > > >> things:
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of
> > the
> > > > > > > > constructor
> > > > > > > > > > of
> > > > > > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker,
> > but
> > > if
> > > > > > there
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > >> a
> > > > > > > > > > > >>>>>>>>>> new
> > > > > > > > > > > >>>>>>>>>>>> RC,
> > > > > > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > > > > > >>>>>>>>>>
> https://github.com/apache/parquet-mr/pull/699
> > > > > > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the
> > > > changelog,
> > > > > > is
> > > > > > > > that
> > > > > > > > > > > >>>>>>>>>> checksums
> > > > > > > > > > > >>>>>>>>>>>> are
> > > > > > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > > > > > > >>>>>>>>>>>>>> This
> > > > > > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> > > > > disabling
> > > > > > > it
> > > > > > > > by
> > > > > > > > > > > >>>>>>>>>> default:
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > https://github.com/apache/parquet-mr/pull/700
> > > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating
> > > > Iceberg,
> > > > > > I've
> > > > > > > > > > noticed
> > > > > > > > > > > >>>>>>>>>> that
> > > > > > > > > > > >>>>>>>>>>>> the
> > > > > > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > > > > > > >>>>>>>>>>>>>> The
> > > > > > > > > > > >>>>>>>>>>>>>> two records are now divided over four
> > Spark
> > > > > > > > partitions.
> > > > > > > > > > > >>>>>>>>>> Something
> > > > > > > > > > > >>>>>>>>>>> in
> > > > > > > > > > > >>>>>>>>>>>>> the
> > > > > > > > > > > >>>>>>>>>>>>>> output has changed since the files are
> > > bigger
> > > > > now.
> > > > > > > Has
> > > > > > > > > > > anyone
> > > > > > > > > > > >>>>>>>>>> any
> > > > > > > > > > > >>>>>>>>>>>> idea
> > > > > > > > > > > >>>>>>>>>>>>>> to
> > > > > > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check
> > > this?
> > > > > The
> > > > > > > only
> > > > > > > > > > thing
> > > > > > > > > > > >> I
> > > > > > > > > > > >>>>>>>>>> can
> > > > > > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff
> >  562B
> > > 17
> > > > > nov
> > > > > > > > 21:09
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff
> >  611B
> > > 17
> > > > > nov
> > > > > > > > 21:05
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>
> > > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef
> Junjie
> > > > Chen
> > > > > <
> > > > > > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran
> mvn
> > > > > install
> > > > > > > > > > > >> successfully.
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yumwang@ebay.com.invalid
> >
> > > > > > > > 于2019年11月14日周四
> > > > > > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL
> > > module:
> > > > > > > > build/sbt
> > > > > > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor
> > Szadovszky"
> > > <
> > > > > > > > > > > >> gabor@apache.org>
> > > > > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be
> > released
> > > as
> > > > > > > > official
> > > > > > > > > > > >>>>>>>>>> Apache
> > > > > > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > > > > > apache-parquet-1.11.0-rc7
> > > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and
> > > > checksums
> > > > > > are
> > > > > > > > > here:
> > > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus
> > here:
> > > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> This release includes the changes
> listed
> > > at:
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet
> > > 1.11.0
> > > > > > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>> --
> > > > > > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>> --
> > > > > > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > > > > > >>>>>>>>>> Netflix
> > > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > > >>>>>>>>>
> > > > > > > > > > > >>>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> --
> > > > > > > > > > > >>>> Ryan Blue
> > > > > > > > > > > >>>> Software Engineer
> > > > > > > > > > > >>>> Netflix
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Ryan Blue
> > > > > > > > > > > > Software Engineer
> > > > > > > > > > > > Netflix
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ryan Blue
> > > > > > > > > Software Engineer
> > > > > > > > > Netflix
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [RESULT] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Thanks for getting this done, Gabor!

On Fri, Dec 6, 2019 at 12:44 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Thanks, Julien and all of you who have voted.
> With three binding +1 votes and four non-binding +1 votes (no -1 votes)
> this release pass.
> I'll finalize the release in the next hour.
>
> Cheers,
> Gabor
>
> On Fri, Dec 6, 2019 at 12:12 AM Julien Le Dem
> <ju...@wework.com.invalid> wrote:
>
> > I verified the signatures
> > ran the build and test
> > It looks like the compatibility changes being discussed are not blockers.
> >
> > +1 (binding)
> >
> >
> > On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
> >
> > > Thanks, Zoltan.
> > >
> > > I also vote +1 (binding)
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > - I have read through the problem reports in this e-mail thread (one
> > > caused
> > > > by the use of a private method via reflection an another one caused
> by
> > > > having mixed versions of the libraries on the classpath) and I am
> > > convinced
> > > > that they do not block the release.
> > > > - Signature and hash of the source tarball are valid.
> > > > - The specified git hash matches the specified git tag.
> > > > - The contents of the source tarball match the contents of the git
> repo
> > > at
> > > > the specified tag.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > >
> > > > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org>
> > > > wrote:
> > > >
> > > > > Created https://issues.apache.org/jira/browse/PARQUET-1703 to
> track
> > > > this.
> > > > >
> > > > > Back to the RC. Anyone from the PMC willing to vote?
> > > > >
> > > > > Cheers,
> > > > > Gabor
> > > > >
> > > > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue
> <rblue@netflix.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Gabor, good point about not being able to check new APIs.
> Updating
> > > the
> > > > > > previous version would also allow us to get rid of temporary
> > > > exclusions,
> > > > > > like the one you pointed out for schema. It would be great to
> > improve
> > > > > what
> > > > > > we catch there.
> > > > > >
> > > > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <
> gabor@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Ryan,
> > > > > > >
> > > > > > > It is a different topic but would like to reflect shortly.
> > > > > > > I understand that 1.7.0 was the first apache release. The
> problem
> > > > with
> > > > > > > doing the compatibility checks comparing to 1.7.0 is that we
> can
> > > > easily
> > > > > > add
> > > > > > > incompatibilities in API which are added after 1.7.0. For
> > example:
> > > > > > Adding a
> > > > > > > new class for public use in 1.8.0 then removing it in 1.9.0.
> The
> > > > > > > compatibility check would not discover this breaking change.
> So,
> > I
> > > > > > think, a
> > > > > > > better approach would be to always compare to the previous
> minor
> > > > > release
> > > > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > > > As I've written before, even org/apache/parquet/schema/** is
> > > excluded
> > > > > > from
> > > > > > > the compatibility check. As far as I know this is public API.
> > So, I
> > > > am
> > > > > > not
> > > > > > > sure that only packages that are not part of the public API are
> > > > > excluded.
> > > > > > >
> > > > > > > Let's discuss this on the next parquet sync.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Gabor
> > > > > > >
> > > > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
> > > <rblue@netflix.com.invalid
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Gabor,
> > > > > > > >
> > > > > > > > 1.7.0 was the first version using the org.apache.parquet
> > > packages,
> > > > so
> > > > > > > > that's the correct base version for compatibility checks. The
> > > > > > exclusions
> > > > > > > in
> > > > > > > > the POM are classes that the Parquet community does not
> > consider
> > > > > > public.
> > > > > > > We
> > > > > > > > rely on these checks to highlight binary incompatibilities,
> and
> > > > then
> > > > > we
> > > > > > > > discuss them on this list or in the dev sync. If the class is
> > > > > internal,
> > > > > > > we
> > > > > > > > add an exclusion for it.
> > > > > > > >
> > > > > > > > I know you're familiar with this process since we've talked
> > about
> > > > it
> > > > > > > > before. I also know that you'd rather have more strict binary
> > > > > > > > compatibility, but until we have someone with the time to do
> > some
> > > > > > > > maintenance and build a public API module, I'm afraid that's
> > what
> > > > we
> > > > > > have
> > > > > > > > to work with.
> > > > > > > >
> > > > > > > > Michael,
> > > > > > > >
> > > > > > > > I hope the context above is helpful and explains why running
> a
> > > > binary
> > > > > > > > compatibility check tool will find incompatible changes. We
> > allow
> > > > > > binary
> > > > > > > > incompatible changes to internal classes and modules, like
> > > > > > > parquet-common.
> > > > > > > >
> > > > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > > > gabor@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Ryan,
> > > > > > > > > I would not trust our compatibility checks (semver) too
> much.
> > > > > > > Currently,
> > > > > > > > it
> > > > > > > > > is configured to compare our current version to 1.7.0. It
> > means
> > > > > > > anything
> > > > > > > > > that is added since 1.7.0 and then broke in a later release
> > > won't
> > > > > be
> > > > > > > > > caught. In addition, many packages are excluded from the
> > check
> > > > > > because
> > > > > > > of
> > > > > > > > > different reasons. For example org/apache/parquet/schema/**
> > is
> > > > > > excluded
> > > > > > > > so
> > > > > > > > > if it would really be an API compatibility issue we
> certainly
> > > > > > wouldn't
> > > > > > > > > catch it.
> > > > > > > > >
> > > > > > > > > Michael,
> > > > > > > > > It fails because of a NoSuchMethodError pointing to a
> method
> > > that
> > > > > is
> > > > > > > > newly
> > > > > > > > > introduced in 1.11. Both the caller and the callee shipped
> by
> > > > > > > parquet-mr.
> > > > > > > > > So, I'm quite sure it is a classpath issue. It seems that
> the
> > > > 1.11
> > > > > > > > version
> > > > > > > > > of the parquet-column jar is not on the classpath.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <
> > > heuermh@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The dependency versions are consistent in our artifact
> > > > > > > > > >
> > > > > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > > > > [INFO] |  \-
> > > org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > > > > [INFO] |     \-
> > > > > > > > > >
> > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > > > > [INFO] |  +-
> > > > org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > > > > [INFO] |  |  +-
> > > > > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > > > > [INFO] |  |  \-
> > > > > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > > > > [INFO] |  +-
> > > > org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > > > > [INFO] |  |  +-
> > > > > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > > > > >
> > > > > > > > > > The latter error
> > > > > > > > > >
> > > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted
> due
> > > to
> > > > > > stage
> > > > > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent
> > > > failure:
> > > > > > > Lost
> > > > > > > > > task
> > > > > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > > > > java.lang.NoSuchMethodError:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > >         at
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > > >
> > > > > > > > > > occurs when I attempt to run via spark-submit on Spark
> > 2.4.4
> > > > > > > > > >
> > > > > > > > > > $ spark-submit --version
> > > > > > > > > > Welcome to
> > > > > > > > > >       ____              __
> > > > > > > > > >      / __/__  ___ _____/ /__
> > > > > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > > > > >       /_/
> > > > > > > > > >
> > > > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit
> Server
> > > VM,
> > > > > > > > 1.8.0_191
> > > > > > > > > > Branch
> > > > > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > > > > Revision
> > > > > > > > > > Url
> > > > > > > > > > Type --help for more information.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > > > > <rblue@netflix.com.INVALID
> > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Thanks for looking into it, Nandor. That doesn't sound
> > > like a
> > > > > > > problem
> > > > > > > > > > with
> > > > > > > > > > > Parquet, but a problem with the test environment since
> > > > > > parquet-avro
> > > > > > > > > > depends
> > > > > > > > > > > on a newer API method.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> I'm not sure that this is a binary compatibility
> issue.
> > > The
> > > > > > > missing
> > > > > > > > > > builder
> > > > > > > > > > >> method was recently added in 1.11.0 with the
> > introduction
> > > of
> > > > > the
> > > > > > > new
> > > > > > > > > > >> logical type API, while the original version (one
> with a
> > > > > single
> > > > > > > > > > >> OriginalType input parameter called before by
> > > > > > AvroSchemaConverter)
> > > > > > > > of
> > > > > > > > > > this
> > > > > > > > > > >> method is kept untouched. It seems to me that the
> > Parquet
> > > > > > version
> > > > > > > on
> > > > > > > > > > Spark
> > > > > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > > > > parquet-column
> > > > > > > is
> > > > > > > > > > still
> > > > > > > > > > >> on an older version.
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > > > > heuermh@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>> Perhaps not strictly necessary to say, but if this
> > > > particular
> > > > > > > > > > >>> compatibility break between 1.10 and 1.11 was
> > > intentional,
> > > > > and
> > > > > > no
> > > > > > > > > other
> > > > > > > > > > >>> compatibility breaks are found, I would vote -1
> > > > (non-binding)
> > > > > > on
> > > > > > > > this
> > > > > > > > > > RC
> > > > > > > > > > >>> such that we might go back and revisit the changes to
> > > > > preserve
> > > > > > > > > > >>> compatibility.
> > > > > > > > > > >>>
> > > > > > > > > > >>> I am not sure there is presently enough motivation in
> > the
> > > > > Spark
> > > > > > > > > project
> > > > > > > > > > >>> for a release after 2.4.4 and before 3.0 in which to
> > bump
> > > > the
> > > > > > > > Parquet
> > > > > > > > > > >>> dependency version to 1.11.x.
> > > > > > > > > > >>>
> > > > > > > > > > >>>   michael
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > > > > <rblue@netflix.com.INVALID
> > > > > > > > >
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for
> > public
> > > > > APIs?
> > > > > > > > From
> > > > > > > > > > the
> > > > > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks
> > binary
> > > > > > > > > compatibility
> > > > > > > > > > >> in
> > > > > > > > > > >>>> the type builders.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Looks like this should have been caught by the
> binary
> > > > > > > > compatibility
> > > > > > > > > > >>> checks.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > > > > gabor@apache.org>
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>> Hi Michael,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Unfortunately, I don't have too much experience on
> > > Spark.
> > > > > But
> > > > > > > if
> > > > > > > > > > spark
> > > > > > > > > > >>> uses
> > > > > > > > > > >>>>> the parquet-mr library in an embedded way (that's
> how
> > > > Hive
> > > > > > uses
> > > > > > > > it)
> > > > > > > > > > it
> > > > > > > > > > >>> is
> > > > > > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Regards,
> > > > > > > > > > >>>>> Gabor
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > > > > heuermh@gmail.com
> > > > > > > > >
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>>> It appears a provided scope dependency on
> spark-sql
> > > > leaks
> > > > > > old
> > > > > > > > > > parquet
> > > > > > > > > > >>>>>> versions was causing the runtime error below.
> After
> > > > > > including
> > > > > > > > new
> > > > > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > > > > dependencies
> > > > > > > (in
> > > > > > > > > > >>> addition
> > > > > > > > > > >>>>>> to parquet-avro, which we already have at compile
> > > > scope),
> > > > > > our
> > > > > > > > > build
> > > > > > > > > > >>>>>> succeeds.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232
> <
> > > > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232
> >
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> However, when running via spark-submit I run into
> a
> > > > > similar
> > > > > > > > > runtime
> > > > > > > > > > >>> error
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > >
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > >
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > > >>>>>>       at
> > > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > >
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > >
> > > > > >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > > >>>>>>       at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Will bumping our library dependency version to
> 1.11
> > > > > require
> > > > > > a
> > > > > > > > new
> > > > > > > > > > >>> version
> > > > > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>> Please accept my apologies if this is heading
> > > > out-of-scope
> > > > > > for
> > > > > > > > the
> > > > > > > > > > >>>>> Parquet
> > > > > > > > > > >>>>>> mailing list.
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>  michael
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > > > > heuermh@GMAIL.COM
> > > > > > > > >
> > > > > > > > > > >>> wrote:
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> I am willing to do some benchmarking on genomic
> > data
> > > at
> > > > > > scale
> > > > > > > > but
> > > > > > > > > > am
> > > > > > > > > > >>>>> not
> > > > > > > > > > >>>>>> quite sure what the Spark target version for
> 1.11.0
> > > > might
> > > > > > be.
> > > > > > > > > Will
> > > > > > > > > > >>>>> Parquet
> > > > > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime
> in
> > > our
> > > > > > build
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> …
> > > > > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > > > > java.lang.NoClassDefFoundError:
> > > > > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>
> > > > > > > > >
> > > > > >
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > > >>>>>>
> > > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > >
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > > >>>>>>>     at
> > > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > >
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > >
> > > > > >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > > > > >>>>>>>     at
> > > > > > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > > > > > >>>>>>>     at
> > > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > > > > > >>>>>>>     at
> > > > > > > > > > >>
> > > > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > > > > > >>>>>>>     at
> > > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>> michael
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > > > > > gabor@apache.org
> > > > > > > > > >
> > > > > > > > > > >>>>> wrote:
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> > > > afraid, I
> > > > > > > won't
> > > > > > > > > > have
> > > > > > > > > > >>>>>> enough
> > > > > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> Cheers,
> > > > > > > > > > >>>>>>>> Gabor
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong,
> Fokko
> > > > > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>> wrote:
> > > > > > > > > > >>>>>>>>
> > > > > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to
> > > change
> > > > my
> > > > > > > vote
> > > > > > > > to
> > > > > > > > > > +1
> > > > > > > > > > >>>>>>>>> (non-binding).
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this
> with
> > > > real
> > > > > > data
> > > > > > > > to
> > > > > > > > > > see
> > > > > > > > > > >>>>> the
> > > > > > > > > > >>>>>>>>>> effect? I think those results would be
> helpful.
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor
> > Szadovszky
> > > <
> > > > > > > > > > >>> gabor@apache.org
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column
> > > indexes
> > > > > > > feature
> > > > > > > > > we
> > > > > > > > > > >>>>>>>>> introduced
> > > > > > > > > > >>>>>>>>>>> two new structures saved before the footer:
> > > column
> > > > > > > indexes
> > > > > > > > > and
> > > > > > > > > > >>>>> offset
> > > > > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too
> > long,
> > > > then
> > > > > > the
> > > > > > > > > > >>>>> truncation
> > > > > > > > > > >>>>>>>>>> might
> > > > > > > > > > >>>>>>>>>>> not decrease the file size because of the
> > offset
> > > > > > indexes.
> > > > > > > > > > >>> Moreover,
> > > > > > > > > > >>>>>> we
> > > > > > > > > > >>>>>>>>>> also
> > > > > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which
> > > might
> > > > > > > > increase
> > > > > > > > > > the
> > > > > > > > > > >>>>>> number
> > > > > > > > > > >>>>>>>>>> of
> > > > > > > > > > >>>>>>>>>>> pages which leads to increasing the file
> size.
> > > > > > > > > > >>>>>>>>>>> The footer itself is also changed and we are
> > > saving
> > > > > > more
> > > > > > > > > values
> > > > > > > > > > >> in
> > > > > > > > > > >>>>>> it:
> > > > > > > > > > >>>>>>>>>> the
> > > > > > > > > > >>>>>>>>>>> offset values to the column/offset indexes,
> the
> > > new
> > > > > > > logical
> > > > > > > > > > type
> > > > > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have
> > some
> > > > > > > others).
> > > > > > > > > > >>>>>>>>>>> So, the size of the files with small amount
> of
> > > data
> > > > > > will
> > > > > > > be
> > > > > > > > > > >>>>> increased
> > > > > > > > > > >>>>>>>>>>> (because of the larger footer). The size of
> the
> > > > files
> > > > > > > where
> > > > > > > > > the
> > > > > > > > > > >>>>>> values
> > > > > > > > > > >>>>>>>>>> can
> > > > > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> > > > increased
> > > > > > > > > (because
> > > > > > > > > > >> we
> > > > > > > > > > >>>>>> will
> > > > > > > > > > >>>>>>>>>> have
> > > > > > > > > > >>>>>>>>>>> more pages). The size of some files where the
> > > > values
> > > > > > are
> > > > > > > > long
> > > > > > > > > > >>>>>> (>64bytes
> > > > > > > > > > >>>>>>>>>> by
> > > > > > > > > > >>>>>>>>>>> default) might be decreased because of
> > truncating
> > > > the
> > > > > > > > min/max
> > > > > > > > > > >>>>> values.
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>> Regards,
> > > > > > > > > > >>>>>>>>>>> Gabor
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>>>>>> wrote:
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional
> > > > overhead
> > > > > > > for a
> > > > > > > > > > >>>>> non-test
> > > > > > > > > > >>>>>>>>>> data
> > > > > > > > > > >>>>>>>>>>>> file? It should be easy to validate that
> this
> > > > > doesn't
> > > > > > > > > > introduce
> > > > > > > > > > >>> an
> > > > > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some
> > cases,
> > > it
> > > > > > > should
> > > > > > > > > > >>> actually
> > > > > > > > > > >>>>>> be
> > > > > > > > > > >>>>>>>>>>>> smaller since the column indexes are
> truncated
> > > and
> > > > > > page
> > > > > > > > > stats
> > > > > > > > > > >> are
> > > > > > > > > > >>>>>>>>> not.
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor
> > Szadovszky
> > > > > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid>
> > wrote:
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> For the first point. The referenced
> > constructor
> > > > is
> > > > > > > > private
> > > > > > > > > > and
> > > > > > > > > > >>>>>>>>>> Iceberg
> > > > > > > > > > >>>>>>>>>>>> uses
> > > > > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking
> > > change. I
> > > > > > > think,
> > > > > > > > > > >>>>> parquet-mr
> > > > > > > > > > >>>>>>>>>>> shall
> > > > > > > > > > >>>>>>>>>>>>> not keep private methods only because of
> > > clients
> > > > > > might
> > > > > > > > use
> > > > > > > > > > >> them
> > > > > > > > > > >>>>> via
> > > > > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having
> the
> > > CRC
> > > > > > > > checksum
> > > > > > > > > > >> write
> > > > > > > > > > >>>>>>>>>>> enabled
> > > > > > > > > > >>>>>>>>>>>> by
> > > > > > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > > > > > significant
> > > > > > > > > > >>>>> performance
> > > > > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > > > > > >>> for
> > > > > > > > > > >>>>>>>>>>>> details.
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is
> > > introducing
> > > > > > > column
> > > > > > > > > > >>> indexes,
> > > > > > > > > > >>>>>>>>> CRC
> > > > > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the
> > page
> > > > > > headers
> > > > > > > > and
> > > > > > > > > > >>> maybe
> > > > > > > > > > >>>>>>>>>> other
> > > > > > > > > > >>>>>>>>>>>>> changes that impact file size. If only file
> > > size
> > > > is
> > > > > > in
> > > > > > > > > > >> question
> > > > > > > > > > >>> I
> > > > > > > > > > >>>>>>>>>>> cannot
> > > > > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong,
> > > Fokko
> > > > > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side
> > (non-binding)
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0,
> and
> > > > found
> > > > > > > three
> > > > > > > > > > >> things:
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of
> the
> > > > > > > constructor
> > > > > > > > > of
> > > > > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker,
> but
> > if
> > > > > there
> > > > > > > > will
> > > > > > > > > be
> > > > > > > > > > >> a
> > > > > > > > > > >>>>>>>>>> new
> > > > > > > > > > >>>>>>>>>>>> RC,
> > > > > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the
> > > changelog,
> > > > > is
> > > > > > > that
> > > > > > > > > > >>>>>>>>>> checksums
> > > > > > > > > > >>>>>>>>>>>> are
> > > > > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > > > > > >>>>>>>>>>>>>> This
> > > > > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> > > > disabling
> > > > > > it
> > > > > > > by
> > > > > > > > > > >>>>>>>>>> default:
> > > > > > > > > > >>>>>>>>>>>>>>
> > https://github.com/apache/parquet-mr/pull/700
> > > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating
> > > Iceberg,
> > > > > I've
> > > > > > > > > noticed
> > > > > > > > > > >>>>>>>>>> that
> > > > > > > > > > >>>>>>>>>>>> the
> > > > > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > > > > > >>>>>>>>>>>>>> The
> > > > > > > > > > >>>>>>>>>>>>>> two records are now divided over four
> Spark
> > > > > > > partitions.
> > > > > > > > > > >>>>>>>>>> Something
> > > > > > > > > > >>>>>>>>>>> in
> > > > > > > > > > >>>>>>>>>>>>> the
> > > > > > > > > > >>>>>>>>>>>>>> output has changed since the files are
> > bigger
> > > > now.
> > > > > > Has
> > > > > > > > > > anyone
> > > > > > > > > > >>>>>>>>>> any
> > > > > > > > > > >>>>>>>>>>>> idea
> > > > > > > > > > >>>>>>>>>>>>>> to
> > > > > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check
> > this?
> > > > The
> > > > > > only
> > > > > > > > > thing
> > > > > > > > > > >> I
> > > > > > > > > > >>>>>>>>>> can
> > > > > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff
>  562B
> > 17
> > > > nov
> > > > > > > 21:09
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff
>  611B
> > 17
> > > > nov
> > > > > > > 21:05
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > > >>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > > >>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>
> > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie
> > > Chen
> > > > <
> > > > > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn
> > > > install
> > > > > > > > > > >> successfully.
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > > > > > 于2019年11月14日周四
> > > > > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL
> > module:
> > > > > > > build/sbt
> > > > > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor
> Szadovszky"
> > <
> > > > > > > > > > >> gabor@apache.org>
> > > > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be
> released
> > as
> > > > > > > official
> > > > > > > > > > >>>>>>>>>> Apache
> > > > > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > > > > apache-parquet-1.11.0-rc7
> > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and
> > > checksums
> > > > > are
> > > > > > > > here:
> > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus
> here:
> > > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed
> > at:
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet
> > 1.11.0
> > > > > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>> --
> > > > > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>> --
> > > > > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > > > > >>>>>>>>>> Netflix
> > > > > > > > > > >>>>>>>>>>
> > > > > > > > > > >>>>>>>>>
> > > > > > > > > > >>>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>>
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> --
> > > > > > > > > > >>>> Ryan Blue
> > > > > > > > > > >>>> Software Engineer
> > > > > > > > > > >>>> Netflix
> > > > > > > > > > >>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Ryan Blue
> > > > > > > > > > > Software Engineer
> > > > > > > > > > > Netflix
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ryan Blue
> > > > > > > > Software Engineer
> > > > > > > > Netflix
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

[RESULT] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Thanks, Julien and all of you who have voted.
With three binding +1 votes and four non-binding +1 votes (no -1 votes)
this release pass.
I'll finalize the release in the next hour.

Cheers,
Gabor

On Fri, Dec 6, 2019 at 12:12 AM Julien Le Dem
<ju...@wework.com.invalid> wrote:

> I verified the signatures
> ran the build and test
> It looks like the compatibility changes being discussed are not blockers.
>
> +1 (binding)
>
>
> On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Thanks, Zoltan.
> >
> > I also vote +1 (binding)
> >
> > Cheers,
> > Gabor
> >
> > On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > - I have read through the problem reports in this e-mail thread (one
> > caused
> > > by the use of a private method via reflection an another one caused by
> > > having mixed versions of the libraries on the classpath) and I am
> > convinced
> > > that they do not block the release.
> > > - Signature and hash of the source tarball are valid.
> > > - The specified git hash matches the specified git tag.
> > > - The contents of the source tarball match the contents of the git repo
> > at
> > > the specified tag.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > >
> > > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org>
> > > wrote:
> > >
> > > > Created https://issues.apache.org/jira/browse/PARQUET-1703 to track
> > > this.
> > > >
> > > > Back to the RC. Anyone from the PMC willing to vote?
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rblue@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Gabor, good point about not being able to check new APIs. Updating
> > the
> > > > > previous version would also allow us to get rid of temporary
> > > exclusions,
> > > > > like the one you pointed out for schema. It would be great to
> improve
> > > > what
> > > > > we catch there.
> > > > >
> > > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <gabor@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi Ryan,
> > > > > >
> > > > > > It is a different topic but would like to reflect shortly.
> > > > > > I understand that 1.7.0 was the first apache release. The problem
> > > with
> > > > > > doing the compatibility checks comparing to 1.7.0 is that we can
> > > easily
> > > > > add
> > > > > > incompatibilities in API which are added after 1.7.0. For
> example:
> > > > > Adding a
> > > > > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > > > > compatibility check would not discover this breaking change. So,
> I
> > > > > think, a
> > > > > > better approach would be to always compare to the previous minor
> > > > release
> > > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > > As I've written before, even org/apache/parquet/schema/** is
> > excluded
> > > > > from
> > > > > > the compatibility check. As far as I know this is public API.
> So, I
> > > am
> > > > > not
> > > > > > sure that only packages that are not part of the public API are
> > > > excluded.
> > > > > >
> > > > > > Let's discuss this on the next parquet sync.
> > > > > >
> > > > > > Regards,
> > > > > > Gabor
> > > > > >
> > > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
> > <rblue@netflix.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Gabor,
> > > > > > >
> > > > > > > 1.7.0 was the first version using the org.apache.parquet
> > packages,
> > > so
> > > > > > > that's the correct base version for compatibility checks. The
> > > > > exclusions
> > > > > > in
> > > > > > > the POM are classes that the Parquet community does not
> consider
> > > > > public.
> > > > > > We
> > > > > > > rely on these checks to highlight binary incompatibilities, and
> > > then
> > > > we
> > > > > > > discuss them on this list or in the dev sync. If the class is
> > > > internal,
> > > > > > we
> > > > > > > add an exclusion for it.
> > > > > > >
> > > > > > > I know you're familiar with this process since we've talked
> about
> > > it
> > > > > > > before. I also know that you'd rather have more strict binary
> > > > > > > compatibility, but until we have someone with the time to do
> some
> > > > > > > maintenance and build a public API module, I'm afraid that's
> what
> > > we
> > > > > have
> > > > > > > to work with.
> > > > > > >
> > > > > > > Michael,
> > > > > > >
> > > > > > > I hope the context above is helpful and explains why running a
> > > binary
> > > > > > > compatibility check tool will find incompatible changes. We
> allow
> > > > > binary
> > > > > > > incompatible changes to internal classes and modules, like
> > > > > > parquet-common.
> > > > > > >
> > > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > > gabor@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Ryan,
> > > > > > > > I would not trust our compatibility checks (semver) too much.
> > > > > > Currently,
> > > > > > > it
> > > > > > > > is configured to compare our current version to 1.7.0. It
> means
> > > > > > anything
> > > > > > > > that is added since 1.7.0 and then broke in a later release
> > won't
> > > > be
> > > > > > > > caught. In addition, many packages are excluded from the
> check
> > > > > because
> > > > > > of
> > > > > > > > different reasons. For example org/apache/parquet/schema/**
> is
> > > > > excluded
> > > > > > > so
> > > > > > > > if it would really be an API compatibility issue we certainly
> > > > > wouldn't
> > > > > > > > catch it.
> > > > > > > >
> > > > > > > > Michael,
> > > > > > > > It fails because of a NoSuchMethodError pointing to a method
> > that
> > > > is
> > > > > > > newly
> > > > > > > > introduced in 1.11. Both the caller and the callee shipped by
> > > > > > parquet-mr.
> > > > > > > > So, I'm quite sure it is a classpath issue. It seems that the
> > > 1.11
> > > > > > > version
> > > > > > > > of the parquet-column jar is not on the classpath.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <
> > heuermh@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The dependency versions are consistent in our artifact
> > > > > > > > >
> > > > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > > > [INFO] |  \-
> > org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > > > [INFO] |     \-
> > > > > > > > >
> > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > > > [INFO] |  +-
> > > org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  +-
> > > > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  \-
> > > > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > > > [INFO] |  +-
> > > org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  +-
> > > > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > > > >
> > > > > > > > > The latter error
> > > > > > > > >
> > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due
> > to
> > > > > stage
> > > > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent
> > > failure:
> > > > > > Lost
> > > > > > > > task
> > > > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > > > java.lang.NoSuchMethodError:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > >         at
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > >
> > > > > > > > > occurs when I attempt to run via spark-submit on Spark
> 2.4.4
> > > > > > > > >
> > > > > > > > > $ spark-submit --version
> > > > > > > > > Welcome to
> > > > > > > > >       ____              __
> > > > > > > > >      / __/__  ___ _____/ /__
> > > > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > > > >       /_/
> > > > > > > > >
> > > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server
> > VM,
> > > > > > > 1.8.0_191
> > > > > > > > > Branch
> > > > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > > > Revision
> > > > > > > > > Url
> > > > > > > > > Type --help for more information.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > > > <rblue@netflix.com.INVALID
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks for looking into it, Nandor. That doesn't sound
> > like a
> > > > > > problem
> > > > > > > > > with
> > > > > > > > > > Parquet, but a problem with the test environment since
> > > > > parquet-avro
> > > > > > > > > depends
> > > > > > > > > > on a newer API method.
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> I'm not sure that this is a binary compatibility issue.
> > The
> > > > > > missing
> > > > > > > > > builder
> > > > > > > > > >> method was recently added in 1.11.0 with the
> introduction
> > of
> > > > the
> > > > > > new
> > > > > > > > > >> logical type API, while the original version (one with a
> > > > single
> > > > > > > > > >> OriginalType input parameter called before by
> > > > > AvroSchemaConverter)
> > > > > > > of
> > > > > > > > > this
> > > > > > > > > >> method is kept untouched. It seems to me that the
> Parquet
> > > > > version
> > > > > > on
> > > > > > > > > Spark
> > > > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > > > parquet-column
> > > > > > is
> > > > > > > > > still
> > > > > > > > > >> on an older version.
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > > > heuermh@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Perhaps not strictly necessary to say, but if this
> > > particular
> > > > > > > > > >>> compatibility break between 1.10 and 1.11 was
> > intentional,
> > > > and
> > > > > no
> > > > > > > > other
> > > > > > > > > >>> compatibility breaks are found, I would vote -1
> > > (non-binding)
> > > > > on
> > > > > > > this
> > > > > > > > > RC
> > > > > > > > > >>> such that we might go back and revisit the changes to
> > > > preserve
> > > > > > > > > >>> compatibility.
> > > > > > > > > >>>
> > > > > > > > > >>> I am not sure there is presently enough motivation in
> the
> > > > Spark
> > > > > > > > project
> > > > > > > > > >>> for a release after 2.4.4 and before 3.0 in which to
> bump
> > > the
> > > > > > > Parquet
> > > > > > > > > >>> dependency version to 1.11.x.
> > > > > > > > > >>>
> > > > > > > > > >>>   michael
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > > > <rblue@netflix.com.INVALID
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for
> public
> > > > APIs?
> > > > > > > From
> > > > > > > > > the
> > > > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks
> binary
> > > > > > > > compatibility
> > > > > > > > > >> in
> > > > > > > > > >>>> the type builders.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Looks like this should have been caught by the binary
> > > > > > > compatibility
> > > > > > > > > >>> checks.
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > > > gabor@apache.org>
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>>> Hi Michael,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Unfortunately, I don't have too much experience on
> > Spark.
> > > > But
> > > > > > if
> > > > > > > > > spark
> > > > > > > > > >>> uses
> > > > > > > > > >>>>> the parquet-mr library in an embedded way (that's how
> > > Hive
> > > > > uses
> > > > > > > it)
> > > > > > > > > it
> > > > > > > > > >>> is
> > > > > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Regards,
> > > > > > > > > >>>>> Gabor
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > > > heuermh@gmail.com
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> It appears a provided scope dependency on spark-sql
> > > leaks
> > > > > old
> > > > > > > > > parquet
> > > > > > > > > >>>>>> versions was causing the runtime error below.  After
> > > > > including
> > > > > > > new
> > > > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > > > dependencies
> > > > > > (in
> > > > > > > > > >>> addition
> > > > > > > > > >>>>>> to parquet-avro, which we already have at compile
> > > scope),
> > > > > our
> > > > > > > > build
> > > > > > > > > >>>>>> succeeds.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> However, when running via spark-submit I run into a
> > > > similar
> > > > > > > > runtime
> > > > > > > > > >>> error
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > > > > >>
> > > > > > > >
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > >
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > >>>>>>       at
> > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > >
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > >
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Will bumping our library dependency version to 1.11
> > > > require
> > > > > a
> > > > > > > new
> > > > > > > > > >>> version
> > > > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Please accept my apologies if this is heading
> > > out-of-scope
> > > > > for
> > > > > > > the
> > > > > > > > > >>>>> Parquet
> > > > > > > > > >>>>>> mailing list.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>  michael
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > > > heuermh@GMAIL.COM
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> I am willing to do some benchmarking on genomic
> data
> > at
> > > > > scale
> > > > > > > but
> > > > > > > > > am
> > > > > > > > > >>>>> not
> > > > > > > > > >>>>>> quite sure what the Spark target version for 1.11.0
> > > might
> > > > > be.
> > > > > > > > Will
> > > > > > > > > >>>>> Parquet
> > > > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in
> > our
> > > > > build
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> …
> > > > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > > > java.lang.NoClassDefFoundError:
> > > > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > > > > >>
> > > > > > > >
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > >
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > >>>>>>>     at
> > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > >
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > >
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > > > >>>>>>>     at
> > > > > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > > > > >>>>>>>     at
> > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>
> > > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > > > > >>>>>>>     at
> > > > > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> michael
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > > > > gabor@apache.org
> > > > > > > > >
> > > > > > > > > >>>>> wrote:
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> > > afraid, I
> > > > > > won't
> > > > > > > > > have
> > > > > > > > > >>>>>> enough
> > > > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Cheers,
> > > > > > > > > >>>>>>>> Gabor
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>> wrote:
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to
> > change
> > > my
> > > > > > vote
> > > > > > > to
> > > > > > > > > +1
> > > > > > > > > >>>>>>>>> (non-binding).
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with
> > > real
> > > > > data
> > > > > > > to
> > > > > > > > > see
> > > > > > > > > >>>>> the
> > > > > > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor
> Szadovszky
> > <
> > > > > > > > > >>> gabor@apache.org
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column
> > indexes
> > > > > > feature
> > > > > > > > we
> > > > > > > > > >>>>>>>>> introduced
> > > > > > > > > >>>>>>>>>>> two new structures saved before the footer:
> > column
> > > > > > indexes
> > > > > > > > and
> > > > > > > > > >>>>> offset
> > > > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too
> long,
> > > then
> > > > > the
> > > > > > > > > >>>>> truncation
> > > > > > > > > >>>>>>>>>> might
> > > > > > > > > >>>>>>>>>>> not decrease the file size because of the
> offset
> > > > > indexes.
> > > > > > > > > >>> Moreover,
> > > > > > > > > >>>>>> we
> > > > > > > > > >>>>>>>>>> also
> > > > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which
> > might
> > > > > > > increase
> > > > > > > > > the
> > > > > > > > > >>>>>> number
> > > > > > > > > >>>>>>>>>> of
> > > > > > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > > > > > >>>>>>>>>>> The footer itself is also changed and we are
> > saving
> > > > > more
> > > > > > > > values
> > > > > > > > > >> in
> > > > > > > > > >>>>>> it:
> > > > > > > > > >>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>> offset values to the column/offset indexes, the
> > new
> > > > > > logical
> > > > > > > > > type
> > > > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have
> some
> > > > > > others).
> > > > > > > > > >>>>>>>>>>> So, the size of the files with small amount of
> > data
> > > > > will
> > > > > > be
> > > > > > > > > >>>>> increased
> > > > > > > > > >>>>>>>>>>> (because of the larger footer). The size of the
> > > files
> > > > > > where
> > > > > > > > the
> > > > > > > > > >>>>>> values
> > > > > > > > > >>>>>>>>>> can
> > > > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> > > increased
> > > > > > > > (because
> > > > > > > > > >> we
> > > > > > > > > >>>>>> will
> > > > > > > > > >>>>>>>>>> have
> > > > > > > > > >>>>>>>>>>> more pages). The size of some files where the
> > > values
> > > > > are
> > > > > > > long
> > > > > > > > > >>>>>> (>64bytes
> > > > > > > > > >>>>>>>>>> by
> > > > > > > > > >>>>>>>>>>> default) might be decreased because of
> truncating
> > > the
> > > > > > > min/max
> > > > > > > > > >>>>> values.
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> Regards,
> > > > > > > > > >>>>>>>>>>> Gabor
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional
> > > overhead
> > > > > > for a
> > > > > > > > > >>>>> non-test
> > > > > > > > > >>>>>>>>>> data
> > > > > > > > > >>>>>>>>>>>> file? It should be easy to validate that this
> > > > doesn't
> > > > > > > > > introduce
> > > > > > > > > >>> an
> > > > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some
> cases,
> > it
> > > > > > should
> > > > > > > > > >>> actually
> > > > > > > > > >>>>>> be
> > > > > > > > > >>>>>>>>>>>> smaller since the column indexes are truncated
> > and
> > > > > page
> > > > > > > > stats
> > > > > > > > > >> are
> > > > > > > > > >>>>>>>>> not.
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor
> Szadovszky
> > > > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid>
> wrote:
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> For the first point. The referenced
> constructor
> > > is
> > > > > > > private
> > > > > > > > > and
> > > > > > > > > >>>>>>>>>> Iceberg
> > > > > > > > > >>>>>>>>>>>> uses
> > > > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking
> > change. I
> > > > > > think,
> > > > > > > > > >>>>> parquet-mr
> > > > > > > > > >>>>>>>>>>> shall
> > > > > > > > > >>>>>>>>>>>>> not keep private methods only because of
> > clients
> > > > > might
> > > > > > > use
> > > > > > > > > >> them
> > > > > > > > > >>>>> via
> > > > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the
> > CRC
> > > > > > > checksum
> > > > > > > > > >> write
> > > > > > > > > >>>>>>>>>>> enabled
> > > > > > > > > >>>>>>>>>>>> by
> > > > > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > > > > significant
> > > > > > > > > >>>>> performance
> > > > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > > > > >>> for
> > > > > > > > > >>>>>>>>>>>> details.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is
> > introducing
> > > > > > column
> > > > > > > > > >>> indexes,
> > > > > > > > > >>>>>>>>> CRC
> > > > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the
> page
> > > > > headers
> > > > > > > and
> > > > > > > > > >>> maybe
> > > > > > > > > >>>>>>>>>> other
> > > > > > > > > >>>>>>>>>>>>> changes that impact file size. If only file
> > size
> > > is
> > > > > in
> > > > > > > > > >> question
> > > > > > > > > >>> I
> > > > > > > > > >>>>>>>>>>> cannot
> > > > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong,
> > Fokko
> > > > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side
> (non-binding)
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and
> > > found
> > > > > > three
> > > > > > > > > >> things:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > > > > > constructor
> > > > > > > > of
> > > > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but
> if
> > > > there
> > > > > > > will
> > > > > > > > be
> > > > > > > > > >> a
> > > > > > > > > >>>>>>>>>> new
> > > > > > > > > >>>>>>>>>>>> RC,
> > > > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the
> > changelog,
> > > > is
> > > > > > that
> > > > > > > > > >>>>>>>>>> checksums
> > > > > > > > > >>>>>>>>>>>> are
> > > > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > > > > >>>>>>>>>>>>>> This
> > > > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> > > disabling
> > > > > it
> > > > > > by
> > > > > > > > > >>>>>>>>>> default:
> > > > > > > > > >>>>>>>>>>>>>>
> https://github.com/apache/parquet-mr/pull/700
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating
> > Iceberg,
> > > > I've
> > > > > > > > noticed
> > > > > > > > > >>>>>>>>>> that
> > > > > > > > > >>>>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > > > > >>>>>>>>>>>>>> The
> > > > > > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > > > > > partitions.
> > > > > > > > > >>>>>>>>>> Something
> > > > > > > > > >>>>>>>>>>> in
> > > > > > > > > >>>>>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>>>>> output has changed since the files are
> bigger
> > > now.
> > > > > Has
> > > > > > > > > anyone
> > > > > > > > > >>>>>>>>>> any
> > > > > > > > > >>>>>>>>>>>> idea
> > > > > > > > > >>>>>>>>>>>>>> to
> > > > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check
> this?
> > > The
> > > > > only
> > > > > > > > thing
> > > > > > > > > >> I
> > > > > > > > > >>>>>>>>>> can
> > > > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B
> 17
> > > nov
> > > > > > 21:09
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B
> 17
> > > nov
> > > > > > 21:05
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > >>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > >>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>
> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie
> > Chen
> > > <
> > > > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn
> > > install
> > > > > > > > > >> successfully.
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > > > > 于2019年11月14日周四
> > > > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL
> module:
> > > > > > build/sbt
> > > > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky"
> <
> > > > > > > > > >> gabor@apache.org>
> > > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released
> as
> > > > > > official
> > > > > > > > > >>>>>>>>>> Apache
> > > > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > > > apache-parquet-1.11.0-rc7
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and
> > checksums
> > > > are
> > > > > > > here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed
> at:
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet
> 1.11.0
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> --
> > > > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>> --
> > > > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > > > >>>>>>>>>> Netflix
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> --
> > > > > > > > > >>>> Ryan Blue
> > > > > > > > > >>>> Software Engineer
> > > > > > > > > >>>> Netflix
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Ryan Blue
> > > > > > > > > > Software Engineer
> > > > > > > > > > Netflix
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Xinli shang <sh...@uber.com.INVALID>.

I ran tests on production data that have file size 10K~200M with CRC
checksum enabled and disabled as a comparison. There is no
significant difference is seen.

CRC Enabled Write used time:
14371 ms

CRC Disabled Write used time:
14355 ms

On Thu, Dec 5, 2019 at 11:09 AM Julien Le Dem
<ju...@wework.com.invalid> wrote:

> I verified the signatures
> ran the build and test
> It looks like the compatibility changes being discussed are not blockers.
>
> +1 (binding)
>
>
> On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Thanks, Zoltan.
> >
> > I also vote +1 (binding)
> >
> > Cheers,
> > Gabor
> >
> > On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > - I have read through the problem reports in this e-mail thread (one
> > caused
> > > by the use of a private method via reflection an another one caused by
> > > having mixed versions of the libraries on the classpath) and I am
> > convinced
> > > that they do not block the release.
> > > - Signature and hash of the source tarball are valid.
> > > - The specified git hash matches the specified git tag.
> > > - The contents of the source tarball match the contents of the git repo
> > at
> > > the specified tag.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > >
> > > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org>
> > > wrote:
> > >
> > > > Created
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PARQUET-2D1703&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=sgUj-DJ6A0vGBMalFBKeIy_8PwUeNX39K31GnzWgcsI&e=
> to track
> > > this.
> > > >
> > > > Back to the RC. Anyone from the PMC willing to vote?
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rblue@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Gabor, good point about not being able to check new APIs. Updating
> > the
> > > > > previous version would also allow us to get rid of temporary
> > > exclusions,
> > > > > like the one you pointed out for schema. It would be great to
> improve
> > > > what
> > > > > we catch there.
> > > > >
> > > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <gabor@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi Ryan,
> > > > > >
> > > > > > It is a different topic but would like to reflect shortly.
> > > > > > I understand that 1.7.0 was the first apache release. The problem
> > > with
> > > > > > doing the compatibility checks comparing to 1.7.0 is that we can
> > > easily
> > > > > add
> > > > > > incompatibilities in API which are added after 1.7.0. For
> example:
> > > > > Adding a
> > > > > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > > > > compatibility check would not discover this breaking change. So,
> I
> > > > > think, a
> > > > > > better approach would be to always compare to the previous minor
> > > > release
> > > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > > As I've written before, even org/apache/parquet/schema/** is
> > excluded
> > > > > from
> > > > > > the compatibility check. As far as I know this is public API.
> So, I
> > > am
> > > > > not
> > > > > > sure that only packages that are not part of the public API are
> > > > excluded.
> > > > > >
> > > > > > Let's discuss this on the next parquet sync.
> > > > > >
> > > > > > Regards,
> > > > > > Gabor
> > > > > >
> > > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
> > <rblue@netflix.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Gabor,
> > > > > > >
> > > > > > > 1.7.0 was the first version using the org.apache.parquet
> > packages,
> > > so
> > > > > > > that's the correct base version for compatibility checks. The
> > > > > exclusions
> > > > > > in
> > > > > > > the POM are classes that the Parquet community does not
> consider
> > > > > public.
> > > > > > We
> > > > > > > rely on these checks to highlight binary incompatibilities, and
> > > then
> > > > we
> > > > > > > discuss them on this list or in the dev sync. If the class is
> > > > internal,
> > > > > > we
> > > > > > > add an exclusion for it.
> > > > > > >
> > > > > > > I know you're familiar with this process since we've talked
> about
> > > it
> > > > > > > before. I also know that you'd rather have more strict binary
> > > > > > > compatibility, but until we have someone with the time to do
> some
> > > > > > > maintenance and build a public API module, I'm afraid that's
> what
> > > we
> > > > > have
> > > > > > > to work with.
> > > > > > >
> > > > > > > Michael,
> > > > > > >
> > > > > > > I hope the context above is helpful and explains why running a
> > > binary
> > > > > > > compatibility check tool will find incompatible changes. We
> allow
> > > > > binary
> > > > > > > incompatible changes to internal classes and modules, like
> > > > > > parquet-common.
> > > > > > >
> > > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > > gabor@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Ryan,
> > > > > > > > I would not trust our compatibility checks (semver) too much.
> > > > > > Currently,
> > > > > > > it
> > > > > > > > is configured to compare our current version to 1.7.0. It
> means
> > > > > > anything
> > > > > > > > that is added since 1.7.0 and then broke in a later release
> > won't
> > > > be
> > > > > > > > caught. In addition, many packages are excluded from the
> check
> > > > > because
> > > > > > of
> > > > > > > > different reasons. For example org/apache/parquet/schema/**
> is
> > > > > excluded
> > > > > > > so
> > > > > > > > if it would really be an API compatibility issue we certainly
> > > > > wouldn't
> > > > > > > > catch it.
> > > > > > > >
> > > > > > > > Michael,
> > > > > > > > It fails because of a NoSuchMethodError pointing to a method
> > that
> > > > is
> > > > > > > newly
> > > > > > > > introduced in 1.11. Both the caller and the callee shipped by
> > > > > > parquet-mr.
> > > > > > > > So, I'm quite sure it is a classpath issue. It seems that the
> > > 1.11
> > > > > > > version
> > > > > > > > of the parquet-column jar is not on the classpath.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <
> > heuermh@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The dependency versions are consistent in our artifact
> > > > > > > > >
> > > > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > > > [INFO] |  \-
> > org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > > > [INFO] |     \-
> > > > > > > > >
> > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > > > [INFO] |  +-
> > > org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  +-
> > > > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  \-
> > > > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > > > [INFO] |  +-
> > > org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > > > [INFO] |  |  +-
> > > > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > > > >
> > > > > > > > > The latter error
> > > > > > > > >
> > > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due
> > to
> > > > > stage
> > > > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent
> > > failure:
> > > > > > Lost
> > > > > > > > task
> > > > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > > > java.lang.NoSuchMethodError:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > >         at
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :161)
> > > > > > > > >
> > > > > > > > > occurs when I attempt to run via spark-submit on Spark
> 2.4.4
> > > > > > > > >
> > > > > > > > > $ spark-submit --version
> > > > > > > > > Welcome to
> > > > > > > > >       ____              __
> > > > > > > > >      / __/__  ___ _____/ /__
> > > > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > > > >       /_/
> > > > > > > > >
> > > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server
> > VM,
> > > > > > > 1.8.0_191
> > > > > > > > > Branch
> > > > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > > > Revision
> > > > > > > > > Url
> > > > > > > > > Type --help for more information.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > > > <rblue@netflix.com.INVALID
> > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks for looking into it, Nandor. That doesn't sound
> > like a
> > > > > > problem
> > > > > > > > > with
> > > > > > > > > > Parquet, but a problem with the test environment since
> > > > > parquet-avro
> > > > > > > > > depends
> > > > > > > > > > on a newer API method.
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> I'm not sure that this is a binary compatibility issue.
> > The
> > > > > > missing
> > > > > > > > > builder
> > > > > > > > > >> method was recently added in 1.11.0 with the
> introduction
> > of
> > > > the
> > > > > > new
> > > > > > > > > >> logical type API, while the original version (one with a
> > > > single
> > > > > > > > > >> OriginalType input parameter called before by
> > > > > AvroSchemaConverter)
> > > > > > > of
> > > > > > > > > this
> > > > > > > > > >> method is kept untouched. It seems to me that the
> Parquet
> > > > > version
> > > > > > on
> > > > > > > > > Spark
> > > > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > > > parquet-column
> > > > > > is
> > > > > > > > > still
> > > > > > > > > >> on an older version.
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > > > heuermh@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Perhaps not strictly necessary to say, but if this
> > > particular
> > > > > > > > > >>> compatibility break between 1.10 and 1.11 was
> > intentional,
> > > > and
> > > > > no
> > > > > > > > other
> > > > > > > > > >>> compatibility breaks are found, I would vote -1
> > > (non-binding)
> > > > > on
> > > > > > > this
> > > > > > > > > RC
> > > > > > > > > >>> such that we might go back and revisit the changes to
> > > > preserve
> > > > > > > > > >>> compatibility.
> > > > > > > > > >>>
> > > > > > > > > >>> I am not sure there is presently enough motivation in
> the
> > > > Spark
> > > > > > > > project
> > > > > > > > > >>> for a release after 2.4.4 and before 3.0 in which to
> bump
> > > the
> > > > > > > Parquet
> > > > > > > > > >>> dependency version to 1.11.x.
> > > > > > > > > >>>
> > > > > > > > > >>>   michael
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > > > <rblue@netflix.com.INVALID
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for
> public
> > > > APIs?
> > > > > > > From
> > > > > > > > > the
> > > > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks
> binary
> > > > > > > > compatibility
> > > > > > > > > >> in
> > > > > > > > > >>>> the type builders.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Looks like this should have been caught by the binary
> > > > > > > compatibility
> > > > > > > > > >>> checks.
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > > > gabor@apache.org>
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>>> Hi Michael,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Unfortunately, I don't have too much experience on
> > Spark.
> > > > But
> > > > > > if
> > > > > > > > > spark
> > > > > > > > > >>> uses
> > > > > > > > > >>>>> the parquet-mr library in an embedded way (that's how
> > > Hive
> > > > > uses
> > > > > > > it)
> > > > > > > > > it
> > > > > > > > > >>> is
> > > > > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Regards,
> > > > > > > > > >>>>> Gabor
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > > > heuermh@gmail.com
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> It appears a provided scope dependency on spark-sql
> > > leaks
> > > > > old
> > > > > > > > > parquet
> > > > > > > > > >>>>>> versions was causing the runtime error below.  After
> > > > > including
> > > > > > > new
> > > > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > > > dependencies
> > > > > > (in
> > > > > > > > > >>> addition
> > > > > > > > > >>>>>> to parquet-avro, which we already have at compile
> > > scope),
> > > > > our
> > > > > > > > build
> > > > > > > > > >>>>>> succeeds.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bigdatagenomics_adam_pull_2232&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=QR-gkcFEuOecntHA5v79xckD1k54e3Hcu7j0RiI78kE&e=
> <
> > > > > > > > > >>>>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bigdatagenomics_adam_pull_2232&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=QR-gkcFEuOecntHA5v79xckD1k54e3Hcu7j0RiI78kE&e=
> >
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> However, when running via spark-submit I run into a
> > > > similar
> > > > > > > > runtime
> > > > > > > > > >>> error
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :161)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :226)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :182)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :141)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :244)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convertFields(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :135)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.avro.AvroSchemaConverter.convert(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroSchemaConverter.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=fPqvqtlS18V1SPu-P8rpbffYhgNzHwYy27UAQfp82y8&e=
> :126)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.parquet.avro.AvroWriteSupport.init(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroWriteSupport.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=AnW5dIuzP3qugBtSelhftpM9wJ6qmyf3Jh4UOe84g2M&e=
> :121)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ParquetOutputFormat.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=BphWJojY_H6zmfK7_APdH7EnAqPnbpe8HQR5c6FCdqc&e=
> :388)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ParquetOutputFormat.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=BphWJojY_H6zmfK7_APdH7EnAqPnbpe8HQR5c6FCdqc&e=
> :349)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > >>>>>>       at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > > > > >>
> > > > > > > >
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > >>>>>>       at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > >>>>>>       at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > >>>>>>       at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > >
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > >>>>>>       at
> > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > >
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > >
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ThreadPoolExecutor.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=vqAmsqMs16YnJIzOu5RXbKPUar7cxKcZMYDAMmk5y8o&e=
> :1149)
> > > > > > > > > >>>>>>       at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ThreadPoolExecutor.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=vqAmsqMs16YnJIzOu5RXbKPUar7cxKcZMYDAMmk5y8o&e=
> :624)
> > > > > > > > > >>>>>>       at java.lang.Thread.run(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__Thread.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=AIV_giot2889YjBVYzI9kK6hRjD7_zW6pluPyj7E9jg&e=
> :748)
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Will bumping our library dependency version to 1.11
> > > > require
> > > > > a
> > > > > > > new
> > > > > > > > > >>> version
> > > > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>> Please accept my apologies if this is heading
> > > out-of-scope
> > > > > for
> > > > > > > the
> > > > > > > > > >>>>> Parquet
> > > > > > > > > >>>>>> mailing list.
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>  michael
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > > > heuermh@GMAIL.COM
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> I am willing to do some benchmarking on genomic
> data
> > at
> > > > > scale
> > > > > > > but
> > > > > > > > > am
> > > > > > > > > >>>>> not
> > > > > > > > > >>>>>> quite sure what the Spark target version for 1.11.0
> > > might
> > > > > be.
> > > > > > > > Will
> > > > > > > > > >>>>> Parquet
> > > > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in
> > our
> > > > > build
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> …
> > > > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > > > java.lang.NoClassDefFoundError:
> > > > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>
> > > > > > > > >
> > > > > > >
> > > > >
> > > org.apache.parquet.avro.AvroWriteSupport.init(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__AvroWriteSupport.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=AnW5dIuzP3qugBtSelhftpM9wJ6qmyf3Jh4UOe84g2M&e=
> :121)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ParquetOutputFormat.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=BphWJojY_H6zmfK7_APdH7EnAqPnbpe8HQR5c6FCdqc&e=
> :388)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ParquetOutputFormat.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=BphWJojY_H6zmfK7_APdH7EnAqPnbpe8HQR5c6FCdqc&e=
> :349)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > > >>>>>>>     at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > > > > >>
> > > > > > > >
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > > >>>>>>>     at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > > >>>>>>>     at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > > >>>>>>>     at
> https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.spark.internal.io&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=oUUc68R-A0q141eyiEb8tSiXUnMlK1RJQAFk2Y0AcC8&e=
> > > > > > > > > >>>>>>
> > > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > >
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > > >>>>>>>     at
> > > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > >
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > >
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ThreadPoolExecutor.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=vqAmsqMs16YnJIzOu5RXbKPUar7cxKcZMYDAMmk5y8o&e=
> :1149)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ThreadPoolExecutor.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=vqAmsqMs16YnJIzOu5RXbKPUar7cxKcZMYDAMmk5y8o&e=
> :624)
> > > > > > > > > >>>>>>>     at java.lang.Thread.run(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__Thread.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=AIV_giot2889YjBVYzI9kK6hRjD7_zW6pluPyj7E9jg&e=
> :748)
> > > > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > > > >>>>>>>     at
> > > > > > > java.net.URLClassLoader.findClass(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__URLClassLoader.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=l0rcPNQkDktHTJpSRyLDNA6oxD-DNFdG1sbmbXJHoBw&e=
> :382)
> > > > > > > > > >>>>>>>     at
> > > > > java.lang.ClassLoader.loadClass(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ClassLoader.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=_Y3zN39Q3XTEKHmzI7GlUTW_UnI27oCYeayHeeMa6N4&e=
> :424)
> > > > > > > > > >>>>>>>     at
> > > > > > > > > >>
> > > sun.misc.Launcher$AppClassLoader.loadClass(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__Launcher.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=PiLkstETFC9XjKt60e7oupwqEazGD7KDBoHGUcwnTds&e=
> :349)
> > > > > > > > > >>>>>>>     at
> > > > > java.lang.ClassLoader.loadClass(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ClassLoader.java&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=_Y3zN39Q3XTEKHmzI7GlUTW_UnI27oCYeayHeeMa6N4&e=
> :357)
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>> michael
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > > > > gabor@apache.org
> > > > > > > > >
> > > > > > > > > >>>>> wrote:
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> > > afraid, I
> > > > > > won't
> > > > > > > > > have
> > > > > > > > > >>>>>> enough
> > > > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> Cheers,
> > > > > > > > > >>>>>>>> Gabor
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>> wrote:
> > > > > > > > > >>>>>>>>
> > > > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to
> > change
> > > my
> > > > > > vote
> > > > > > > to
> > > > > > > > > +1
> > > > > > > > > >>>>>>>>> (non-binding).
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with
> > > real
> > > > > data
> > > > > > > to
> > > > > > > > > see
> > > > > > > > > >>>>> the
> > > > > > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor
> Szadovszky
> > <
> > > > > > > > > >>> gabor@apache.org
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column
> > indexes
> > > > > > feature
> > > > > > > > we
> > > > > > > > > >>>>>>>>> introduced
> > > > > > > > > >>>>>>>>>>> two new structures saved before the footer:
> > column
> > > > > > indexes
> > > > > > > > and
> > > > > > > > > >>>>> offset
> > > > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too
> long,
> > > then
> > > > > the
> > > > > > > > > >>>>> truncation
> > > > > > > > > >>>>>>>>>> might
> > > > > > > > > >>>>>>>>>>> not decrease the file size because of the
> offset
> > > > > indexes.
> > > > > > > > > >>> Moreover,
> > > > > > > > > >>>>>> we
> > > > > > > > > >>>>>>>>>> also
> > > > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which
> > might
> > > > > > > increase
> > > > > > > > > the
> > > > > > > > > >>>>>> number
> > > > > > > > > >>>>>>>>>> of
> > > > > > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > > > > > >>>>>>>>>>> The footer itself is also changed and we are
> > saving
> > > > > more
> > > > > > > > values
> > > > > > > > > >> in
> > > > > > > > > >>>>>> it:
> > > > > > > > > >>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>> offset values to the column/offset indexes, the
> > new
> > > > > > logical
> > > > > > > > > type
> > > > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have
> some
> > > > > > others).
> > > > > > > > > >>>>>>>>>>> So, the size of the files with small amount of
> > data
> > > > > will
> > > > > > be
> > > > > > > > > >>>>> increased
> > > > > > > > > >>>>>>>>>>> (because of the larger footer). The size of the
> > > files
> > > > > > where
> > > > > > > > the
> > > > > > > > > >>>>>> values
> > > > > > > > > >>>>>>>>>> can
> > > > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> > > increased
> > > > > > > > (because
> > > > > > > > > >> we
> > > > > > > > > >>>>>> will
> > > > > > > > > >>>>>>>>>> have
> > > > > > > > > >>>>>>>>>>> more pages). The size of some files where the
> > > values
> > > > > are
> > > > > > > long
> > > > > > > > > >>>>>> (>64bytes
> > > > > > > > > >>>>>>>>>> by
> > > > > > > > > >>>>>>>>>>> default) might be decreased because of
> truncating
> > > the
> > > > > > > min/max
> > > > > > > > > >>>>> values.
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> Regards,
> > > > > > > > > >>>>>>>>>>> Gabor
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional
> > > overhead
> > > > > > for a
> > > > > > > > > >>>>> non-test
> > > > > > > > > >>>>>>>>>> data
> > > > > > > > > >>>>>>>>>>>> file? It should be easy to validate that this
> > > > doesn't
> > > > > > > > > introduce
> > > > > > > > > >>> an
> > > > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some
> cases,
> > it
> > > > > > should
> > > > > > > > > >>> actually
> > > > > > > > > >>>>>> be
> > > > > > > > > >>>>>>>>>>>> smaller since the column indexes are truncated
> > and
> > > > > page
> > > > > > > > stats
> > > > > > > > > >> are
> > > > > > > > > >>>>>>>>> not.
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor
> Szadovszky
> > > > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid>
> wrote:
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> For the first point. The referenced
> constructor
> > > is
> > > > > > > private
> > > > > > > > > and
> > > > > > > > > >>>>>>>>>> Iceberg
> > > > > > > > > >>>>>>>>>>>> uses
> > > > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking
> > change. I
> > > > > > think,
> > > > > > > > > >>>>> parquet-mr
> > > > > > > > > >>>>>>>>>>> shall
> > > > > > > > > >>>>>>>>>>>>> not keep private methods only because of
> > clients
> > > > > might
> > > > > > > use
> > > > > > > > > >> them
> > > > > > > > > >>>>> via
> > > > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the
> > CRC
> > > > > > > checksum
> > > > > > > > > >> write
> > > > > > > > > >>>>>>>>>>> enabled
> > > > > > > > > >>>>>>>>>>>> by
> > > > > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > > > > significant
> > > > > > > > > >>>>> performance
> > > > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_647&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=qqrHpGSG1Ku1bD-rQwjZlR33DoKe-4NEwUNCw8eoUFk&e=
> > > > > > > > > >>> for
> > > > > > > > > >>>>>>>>>>>> details.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is
> > introducing
> > > > > > column
> > > > > > > > > >>> indexes,
> > > > > > > > > >>>>>>>>> CRC
> > > > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the
> page
> > > > > headers
> > > > > > > and
> > > > > > > > > >>> maybe
> > > > > > > > > >>>>>>>>>> other
> > > > > > > > > >>>>>>>>>>>>> changes that impact file size. If only file
> > size
> > > is
> > > > > in
> > > > > > > > > >> question
> > > > > > > > > >>> I
> > > > > > > > > >>>>>>>>>>> cannot
> > > > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong,
> > Fokko
> > > > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side
> (non-binding)
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and
> > > found
> > > > > > three
> > > > > > > > > >> things:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > > > > > constructor
> > > > > > > > of
> > > > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR80&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=c47ZyLKesRF-F3eK_4G9aROolyJwQPZmdGbnjJQH_eU&e=
> > > > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2Db877faa96f292b851c75fe8bcc1912f8R176&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=7jqIhgiNnkm8N8z8RMRU_TteDEABVZMk-JsO6aQjRXc&e=
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but
> if
> > > > there
> > > > > > > will
> > > > > > > > be
> > > > > > > > > >> a
> > > > > > > > > >>>>>>>>>> new
> > > > > > > > > >>>>>>>>>>>> RC,
> > > > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > > > >>>>>>>>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_699&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=JrIvjB_4COadNSlkUcZaAewxvfSbTsTyBsFUiizFNJU&e=
> > > > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the
> > changelog,
> > > > is
> > > > > > that
> > > > > > > > > >>>>>>>>>> checksums
> > > > > > > > > >>>>>>>>>>>> are
> > > > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dcolumn_src_main_java_org_apache_parquet_column_ParquetProperties.java-23L54&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=2j-7A0WLZLej2EcsO7z9DYl58kOwEtRey_MZYLLk3IQ&e=
> > > > > > > > > >>>>>>>>>>>>>> This
> > > > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> > > disabling
> > > > > it
> > > > > > by
> > > > > > > > > >>>>>>>>>> default:
> > > > > > > > > >>>>>>>>>>>>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_700&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=j5UBc01oaFJO_FD_FbQ48ORjxJBgOMeVvSo8Yy-rqwo&e=
> > > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_commit_e7db9e20f52c925a207ea62d6dda6dc4e870294e-23diff-2Dd007a18083a2431c30a5416f248e0a4bR277&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=AyzLxEWZPSZlnpaz3SE3GTUu5ouF-Inwj2zuWICZAYI&e=
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating
> > Iceberg,
> > > > I've
> > > > > > > > noticed
> > > > > > > > > >>>>>>>>>> that
> > > > > > > > > >>>>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_pull_297_files-23diff-2D4b64b7014f259be41b26cfb73d3e6e93L199&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=qE9EJHf5O1z1IxWBSUQVGGTEXPyEdoILjwfljOP9IE0&e=
> > > > > > > > > >>>>>>>>>>>>>> The
> > > > > > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > > > > > partitions.
> > > > > > > > > >>>>>>>>>> Something
> > > > > > > > > >>>>>>>>>>> in
> > > > > > > > > >>>>>>>>>>>>> the
> > > > > > > > > >>>>>>>>>>>>>> output has changed since the files are
> bigger
> > > now.
> > > > > Has
> > > > > > > > > anyone
> > > > > > > > > >>>>>>>>>> any
> > > > > > > > > >>>>>>>>>>>> idea
> > > > > > > > > >>>>>>>>>>>>>> to
> > > > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check
> this?
> > > The
> > > > > only
> > > > > > > > thing
> > > > > > > > > >> I
> > > > > > > > > >>>>>>>>>> can
> > > > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B
> 17
> > > nov
> > > > > > 21:09
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B
> 17
> > > nov
> > > > > > 21:05
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > >>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > > >>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>
> > > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_Fokko_1c209f158299dc2fb5878c5bae4bf6d8&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=cTzJCzw-nCXuEIWrYnEVoSq3rGSqLk0Y6pRSHowwFQs&e=
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie
> > Chen
> > > <
> > > > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn
> > > install
> > > > > > > > > >> successfully.
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > > > > 于2019年11月14日周四
> > > > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL
> module:
> > > > > > build/sbt
> > > > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky"
> <
> > > > > > > > > >> gabor@apache.org>
> > > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released
> as
> > > > > > official
> > > > > > > > > >>>>>>>>>> Apache
> > > > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > > > apache-parquet-1.11.0-rc7
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Ftree-252F18519eb8e059865652eee3ff0e8593f126701da4-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DToLFrTB9lU-252FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=SJU6SXHUbJvaYgjg95fEBM-7sdiuRWI7iAheLj6HPhw&e=
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and
> > checksums
> > > > are
> > > > > > > here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fdist.apache.org-252Frepos-252Fdist-252Fdev-252Fparquet-252Fapache-2Dparquet-2D1.11.0-2Drc7-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DMPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=xGqVIxUEoUapVwpqRhYkq7OJtoikK0XaLGft9Lblwgw&e=
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fapache.org-252Fdist-252Fparquet-252FKEYS-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DIwG4MUGsP2lVzlD4bwZUEPuEAPUg-252FHXRYtxc5CQupBM-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=yzWOrL6zPFXTus1SdheATIVwHiOBt0Qidq_Hyl8mdpk&e=
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Frepository.apache.org-252Fcontent-252Fgroups-252Fstaging-252Forg-252Fapache-252Fparquet-252F-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3DlHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=20DYwYuaSWdv5A9c3konLGFKzeLDizly8Ol1C9ByqYA&e=
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed
> at:
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com-252Fapache-252Fparquet-2Dmr-252Fblob-252Fapache-2Dparquet-2D1.11.0-2Drc7-252FCHANGES.md-26amp-3Bdata-3D02-257C01-257Cyumwang-2540ebay.com-257C8d588ca5855842a94bed08d7683e1221-257C46326bff992841a0baca17c16c94ea99-257C0-257C0-257C637092488114756267-26amp-3Bsdata-3D82BplI3bLAL6qArLHvVoYReZOk-252BboSP655rI8VX5Q5I-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=FQ88AmOZ4TMjDdqNBGu-ag&m=UfFS1N1YHm7sJbXTdatgZuyK2eS186zlrHz6rA6WZYQ&s=y1GMeLPY2n0dFWRj9sjEapyXX9woFJLI0UE59YYRYfI&e=
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet
> 1.11.0
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>> --
> > > > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > > > >>>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>> --
> > > > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > > > >>>>>>>>>> Netflix
> > > > > > > > > >>>>>>>>>>
> > > > > > > > > >>>>>>>>>
> > > > > > > > > >>>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>>
> > > > > > > > > >>>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> --
> > > > > > > > > >>>> Ryan Blue
> > > > > > > > > >>>> Software Engineer
> > > > > > > > > >>>> Netflix
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Ryan Blue
> > > > > > > > > > Software Engineer
> > > > > > > > > > Netflix
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> >
>


-- 
Xinli Shang

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Julien Le Dem <ju...@wework.com.INVALID>.

I verified the signatures
ran the build and test
It looks like the compatibility changes being discussed are not blockers.

+1 (binding)


On Wed, Nov 27, 2019 at 1:43 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Thanks, Zoltan.
>
> I also vote +1 (binding)
>
> Cheers,
> Gabor
>
> On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > +1 (binding)
> >
> > - I have read through the problem reports in this e-mail thread (one
> caused
> > by the use of a private method via reflection an another one caused by
> > having mixed versions of the libraries on the classpath) and I am
> convinced
> > that they do not block the release.
> > - Signature and hash of the source tarball are valid.
> > - The specified git hash matches the specified git tag.
> > - The contents of the source tarball match the contents of the git repo
> at
> > the specified tag.
> >
> > Br,
> >
> > Zoltan
> >
> >
> > On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> >
> > > Created https://issues.apache.org/jira/browse/PARQUET-1703 to track
> > this.
> > >
> > > Back to the RC. Anyone from the PMC willing to vote?
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Gabor, good point about not being able to check new APIs. Updating
> the
> > > > previous version would also allow us to get rid of temporary
> > exclusions,
> > > > like the one you pointed out for schema. It would be great to improve
> > > what
> > > > we catch there.
> > > >
> > > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <ga...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Ryan,
> > > > >
> > > > > It is a different topic but would like to reflect shortly.
> > > > > I understand that 1.7.0 was the first apache release. The problem
> > with
> > > > > doing the compatibility checks comparing to 1.7.0 is that we can
> > easily
> > > > add
> > > > > incompatibilities in API which are added after 1.7.0. For example:
> > > > Adding a
> > > > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > > > compatibility check would not discover this breaking change. So, I
> > > > think, a
> > > > > better approach would be to always compare to the previous minor
> > > release
> > > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > > As I've written before, even org/apache/parquet/schema/** is
> excluded
> > > > from
> > > > > the compatibility check. As far as I know this is public API. So, I
> > am
> > > > not
> > > > > sure that only packages that are not part of the public API are
> > > excluded.
> > > > >
> > > > > Let's discuss this on the next parquet sync.
> > > > >
> > > > > Regards,
> > > > > Gabor
> > > > >
> > > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue
> <rblue@netflix.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Gabor,
> > > > > >
> > > > > > 1.7.0 was the first version using the org.apache.parquet
> packages,
> > so
> > > > > > that's the correct base version for compatibility checks. The
> > > > exclusions
> > > > > in
> > > > > > the POM are classes that the Parquet community does not consider
> > > > public.
> > > > > We
> > > > > > rely on these checks to highlight binary incompatibilities, and
> > then
> > > we
> > > > > > discuss them on this list or in the dev sync. If the class is
> > > internal,
> > > > > we
> > > > > > add an exclusion for it.
> > > > > >
> > > > > > I know you're familiar with this process since we've talked about
> > it
> > > > > > before. I also know that you'd rather have more strict binary
> > > > > > compatibility, but until we have someone with the time to do some
> > > > > > maintenance and build a public API module, I'm afraid that's what
> > we
> > > > have
> > > > > > to work with.
> > > > > >
> > > > > > Michael,
> > > > > >
> > > > > > I hope the context above is helpful and explains why running a
> > binary
> > > > > > compatibility check tool will find incompatible changes. We allow
> > > > binary
> > > > > > incompatible changes to internal classes and modules, like
> > > > > parquet-common.
> > > > > >
> > > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> > gabor@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Ryan,
> > > > > > > I would not trust our compatibility checks (semver) too much.
> > > > > Currently,
> > > > > > it
> > > > > > > is configured to compare our current version to 1.7.0. It means
> > > > > anything
> > > > > > > that is added since 1.7.0 and then broke in a later release
> won't
> > > be
> > > > > > > caught. In addition, many packages are excluded from the check
> > > > because
> > > > > of
> > > > > > > different reasons. For example org/apache/parquet/schema/** is
> > > > excluded
> > > > > > so
> > > > > > > if it would really be an API compatibility issue we certainly
> > > > wouldn't
> > > > > > > catch it.
> > > > > > >
> > > > > > > Michael,
> > > > > > > It fails because of a NoSuchMethodError pointing to a method
> that
> > > is
> > > > > > newly
> > > > > > > introduced in 1.11. Both the caller and the callee shipped by
> > > > > parquet-mr.
> > > > > > > So, I'm quite sure it is a classpath issue. It seems that the
> > 1.11
> > > > > > version
> > > > > > > of the parquet-column jar is not on the classpath.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <
> heuermh@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > The dependency versions are consistent in our artifact
> > > > > > > >
> > > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > > [INFO] |  \-
> org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > > [INFO] |     \-
> > > > > > > >
> org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > > [INFO] |  +-
> > org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > > [INFO] |  |  +-
> > > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > > [INFO] |  |  \-
> > > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > > [INFO] |  +-
> > org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > > [INFO] |  |  +-
> > > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > > >
> > > > > > > > The latter error
> > > > > > > >
> > > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due
> to
> > > > stage
> > > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent
> > failure:
> > > > > Lost
> > > > > > > task
> > > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > > java.lang.NoSuchMethodError:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > >         at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > >
> > > > > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > > > > > >
> > > > > > > > $ spark-submit --version
> > > > > > > > Welcome to
> > > > > > > >       ____              __
> > > > > > > >      / __/__  ___ _____/ /__
> > > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > > >       /_/
> > > > > > > >
> > > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server
> VM,
> > > > > > 1.8.0_191
> > > > > > > > Branch
> > > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > > Revision
> > > > > > > > Url
> > > > > > > > Type --help for more information.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > > <rblue@netflix.com.INVALID
> > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks for looking into it, Nandor. That doesn't sound
> like a
> > > > > problem
> > > > > > > > with
> > > > > > > > > Parquet, but a problem with the test environment since
> > > > parquet-avro
> > > > > > > > depends
> > > > > > > > > on a newer API method.
> > > > > > > > >
> > > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> I'm not sure that this is a binary compatibility issue.
> The
> > > > > missing
> > > > > > > > builder
> > > > > > > > >> method was recently added in 1.11.0 with the introduction
> of
> > > the
> > > > > new
> > > > > > > > >> logical type API, while the original version (one with a
> > > single
> > > > > > > > >> OriginalType input parameter called before by
> > > > AvroSchemaConverter)
> > > > > > of
> > > > > > > > this
> > > > > > > > >> method is kept untouched. It seems to me that the Parquet
> > > > version
> > > > > on
> > > > > > > > Spark
> > > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > > parquet-column
> > > > > is
> > > > > > > > still
> > > > > > > > >> on an older version.
> > > > > > > > >>
> > > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > > heuermh@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>> Perhaps not strictly necessary to say, but if this
> > particular
> > > > > > > > >>> compatibility break between 1.10 and 1.11 was
> intentional,
> > > and
> > > > no
> > > > > > > other
> > > > > > > > >>> compatibility breaks are found, I would vote -1
> > (non-binding)
> > > > on
> > > > > > this
> > > > > > > > RC
> > > > > > > > >>> such that we might go back and revisit the changes to
> > > preserve
> > > > > > > > >>> compatibility.
> > > > > > > > >>>
> > > > > > > > >>> I am not sure there is presently enough motivation in the
> > > Spark
> > > > > > > project
> > > > > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump
> > the
> > > > > > Parquet
> > > > > > > > >>> dependency version to 1.11.x.
> > > > > > > > >>>
> > > > > > > > >>>   michael
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > > <rblue@netflix.com.INVALID
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public
> > > APIs?
> > > > > > From
> > > > > > > > the
> > > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > > > > > compatibility
> > > > > > > > >> in
> > > > > > > > >>>> the type builders.
> > > > > > > > >>>>
> > > > > > > > >>>> Looks like this should have been caught by the binary
> > > > > > compatibility
> > > > > > > > >>> checks.
> > > > > > > > >>>>
> > > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > > gabor@apache.org>
> > > > > > > > >>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Hi Michael,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Unfortunately, I don't have too much experience on
> Spark.
> > > But
> > > > > if
> > > > > > > > spark
> > > > > > > > >>> uses
> > > > > > > > >>>>> the parquet-mr library in an embedded way (that's how
> > Hive
> > > > uses
> > > > > > it)
> > > > > > > > it
> > > > > > > > >>> is
> > > > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Regards,
> > > > > > > > >>>>> Gabor
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > > heuermh@gmail.com
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>>> It appears a provided scope dependency on spark-sql
> > leaks
> > > > old
> > > > > > > > parquet
> > > > > > > > >>>>>> versions was causing the runtime error below.  After
> > > > including
> > > > > > new
> > > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > > dependencies
> > > > > (in
> > > > > > > > >>> addition
> > > > > > > > >>>>>> to parquet-avro, which we already have at compile
> > scope),
> > > > our
> > > > > > > build
> > > > > > > > >>>>>> succeeds.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> However, when running via spark-submit I run into a
> > > similar
> > > > > > > runtime
> > > > > > > > >>> error
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > > > > >>
> > > > > > >
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > >
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > >>>>>>       at
> > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > >
> > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > >>>>>>       at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Will bumping our library dependency version to 1.11
> > > require
> > > > a
> > > > > > new
> > > > > > > > >>> version
> > > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Please accept my apologies if this is heading
> > out-of-scope
> > > > for
> > > > > > the
> > > > > > > > >>>>> Parquet
> > > > > > > > >>>>>> mailing list.
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>  michael
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > > heuermh@GMAIL.COM
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> I am willing to do some benchmarking on genomic data
> at
> > > > scale
> > > > > > but
> > > > > > > > am
> > > > > > > > >>>>> not
> > > > > > > > >>>>>> quite sure what the Spark target version for 1.11.0
> > might
> > > > be.
> > > > > > > Will
> > > > > > > > >>>>> Parquet
> > > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in
> our
> > > > build
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> …
> > > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > > java.lang.NoClassDefFoundError:
> > > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > > > > >>
> > > > > > >
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > > >>>>>>
> > > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > >
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > > >>>>>>>     at
> > > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > >
> > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > > >>>>>>>     at
> > > > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > > > >>>>>>>     at
> > > > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > > > >>>>>>>     at
> > > > > > > > >>
> > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > > > >>>>>>>     at
> > > > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> michael
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > > > gabor@apache.org
> > > > > > > >
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> > afraid, I
> > > > > won't
> > > > > > > > have
> > > > > > > > >>>>>> enough
> > > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> Cheers,
> > > > > > > > >>>>>>>> Gabor
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to
> change
> > my
> > > > > vote
> > > > > > to
> > > > > > > > +1
> > > > > > > > >>>>>>>>> (non-binding).
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with
> > real
> > > > data
> > > > > > to
> > > > > > > > see
> > > > > > > > >>>>> the
> > > > > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky
> <
> > > > > > > > >>> gabor@apache.org
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column
> indexes
> > > > > feature
> > > > > > > we
> > > > > > > > >>>>>>>>> introduced
> > > > > > > > >>>>>>>>>>> two new structures saved before the footer:
> column
> > > > > indexes
> > > > > > > and
> > > > > > > > >>>>> offset
> > > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too long,
> > then
> > > > the
> > > > > > > > >>>>> truncation
> > > > > > > > >>>>>>>>>> might
> > > > > > > > >>>>>>>>>>> not decrease the file size because of the offset
> > > > indexes.
> > > > > > > > >>> Moreover,
> > > > > > > > >>>>>> we
> > > > > > > > >>>>>>>>>> also
> > > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which
> might
> > > > > > increase
> > > > > > > > the
> > > > > > > > >>>>>> number
> > > > > > > > >>>>>>>>>> of
> > > > > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > > > > >>>>>>>>>>> The footer itself is also changed and we are
> saving
> > > > more
> > > > > > > values
> > > > > > > > >> in
> > > > > > > > >>>>>> it:
> > > > > > > > >>>>>>>>>> the
> > > > > > > > >>>>>>>>>>> offset values to the column/offset indexes, the
> new
> > > > > logical
> > > > > > > > type
> > > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> > > > > others).
> > > > > > > > >>>>>>>>>>> So, the size of the files with small amount of
> data
> > > > will
> > > > > be
> > > > > > > > >>>>> increased
> > > > > > > > >>>>>>>>>>> (because of the larger footer). The size of the
> > files
> > > > > where
> > > > > > > the
> > > > > > > > >>>>>> values
> > > > > > > > >>>>>>>>>> can
> > > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> > increased
> > > > > > > (because
> > > > > > > > >> we
> > > > > > > > >>>>>> will
> > > > > > > > >>>>>>>>>> have
> > > > > > > > >>>>>>>>>>> more pages). The size of some files where the
> > values
> > > > are
> > > > > > long
> > > > > > > > >>>>>> (>64bytes
> > > > > > > > >>>>>>>>>> by
> > > > > > > > >>>>>>>>>>> default) might be decreased because of truncating
> > the
> > > > > > min/max
> > > > > > > > >>>>> values.
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> Regards,
> > > > > > > > >>>>>>>>>>> Gabor
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional
> > overhead
> > > > > for a
> > > > > > > > >>>>> non-test
> > > > > > > > >>>>>>>>>> data
> > > > > > > > >>>>>>>>>>>> file? It should be easy to validate that this
> > > doesn't
> > > > > > > > introduce
> > > > > > > > >>> an
> > > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases,
> it
> > > > > should
> > > > > > > > >>> actually
> > > > > > > > >>>>>> be
> > > > > > > > >>>>>>>>>>>> smaller since the column indexes are truncated
> and
> > > > page
> > > > > > > stats
> > > > > > > > >> are
> > > > > > > > >>>>>>>>> not.
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> For the first point. The referenced constructor
> > is
> > > > > > private
> > > > > > > > and
> > > > > > > > >>>>>>>>>> Iceberg
> > > > > > > > >>>>>>>>>>>> uses
> > > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking
> change. I
> > > > > think,
> > > > > > > > >>>>> parquet-mr
> > > > > > > > >>>>>>>>>>> shall
> > > > > > > > >>>>>>>>>>>>> not keep private methods only because of
> clients
> > > > might
> > > > > > use
> > > > > > > > >> them
> > > > > > > > >>>>> via
> > > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the
> CRC
> > > > > > checksum
> > > > > > > > >> write
> > > > > > > > >>>>>>>>>>> enabled
> > > > > > > > >>>>>>>>>>>> by
> > > > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > > > significant
> > > > > > > > >>>>> performance
> > > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > > > >>> for
> > > > > > > > >>>>>>>>>>>> details.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is
> introducing
> > > > > column
> > > > > > > > >>> indexes,
> > > > > > > > >>>>>>>>> CRC
> > > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page
> > > > headers
> > > > > > and
> > > > > > > > >>> maybe
> > > > > > > > >>>>>>>>>> other
> > > > > > > > >>>>>>>>>>>>> changes that impact file size. If only file
> size
> > is
> > > > in
> > > > > > > > >> question
> > > > > > > > >>> I
> > > > > > > > >>>>>>>>>>> cannot
> > > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong,
> Fokko
> > > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and
> > found
> > > > > three
> > > > > > > > >> things:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > > > > constructor
> > > > > > > of
> > > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if
> > > there
> > > > > > will
> > > > > > > be
> > > > > > > > >> a
> > > > > > > > >>>>>>>>>> new
> > > > > > > > >>>>>>>>>>>> RC,
> > > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the
> changelog,
> > > is
> > > > > that
> > > > > > > > >>>>>>>>>> checksums
> > > > > > > > >>>>>>>>>>>> are
> > > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > > > >>>>>>>>>>>>>> This
> > > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> > disabling
> > > > it
> > > > > by
> > > > > > > > >>>>>>>>>> default:
> > > > > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > > > > > >>>>>>>>>>>>>> <
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating
> Iceberg,
> > > I've
> > > > > > > noticed
> > > > > > > > >>>>>>>>>> that
> > > > > > > > >>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > > > >>>>>>>>>>>>>> The
> > > > > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > > > > partitions.
> > > > > > > > >>>>>>>>>> Something
> > > > > > > > >>>>>>>>>>> in
> > > > > > > > >>>>>>>>>>>>> the
> > > > > > > > >>>>>>>>>>>>>> output has changed since the files are bigger
> > now.
> > > > Has
> > > > > > > > anyone
> > > > > > > > >>>>>>>>>> any
> > > > > > > > >>>>>>>>>>>> idea
> > > > > > > > >>>>>>>>>>>>>> to
> > > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this?
> > The
> > > > only
> > > > > > > thing
> > > > > > > > >> I
> > > > > > > > >>>>>>>>>> can
> > > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17
> > nov
> > > > > 21:09
> > > > > > > > >>>>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17
> > nov
> > > > > 21:05
> > > > > > > > >>>>>>>>>>>>>>
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > >>>>>>>>>>>>
> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > > >>>>>>>>>>>>
> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>
> > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie
> Chen
> > <
> > > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn
> > install
> > > > > > > > >> successfully.
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > > > 于2019年11月14日周四
> > > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> > > > > build/sbt
> > > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > > > > > >> gabor@apache.org>
> > > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> > > > > official
> > > > > > > > >>>>>>>>>> Apache
> > > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > > apache-parquet-1.11.0-rc7
> > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and
> checksums
> > > are
> > > > > > here:
> > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>> --
> > > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > > >>>>>>>>>>>>
> > > > > > > > >>>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> --
> > > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > > >>>>>>>>>> Netflix
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> --
> > > > > > > > >>>> Ryan Blue
> > > > > > > > >>>> Software Engineer
> > > > > > > > >>>> Netflix
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ryan Blue
> > > > > > > > > Software Engineer
> > > > > > > > > Netflix
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Thanks, Zoltan.

I also vote +1 (binding)

Cheers,
Gabor

On Tue, Nov 26, 2019 at 1:46 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> +1 (binding)
>
> - I have read through the problem reports in this e-mail thread (one caused
> by the use of a private method via reflection an another one caused by
> having mixed versions of the libraries on the classpath) and I am convinced
> that they do not block the release.
> - Signature and hash of the source tarball are valid.
> - The specified git hash matches the specified git tag.
> - The contents of the source tarball match the contents of the git repo at
> the specified tag.
>
> Br,
>
> Zoltan
>
>
> On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
>
> > Created https://issues.apache.org/jira/browse/PARQUET-1703 to track
> this.
> >
> > Back to the RC. Anyone from the PMC willing to vote?
> >
> > Cheers,
> > Gabor
> >
> > On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Gabor, good point about not being able to check new APIs. Updating the
> > > previous version would also allow us to get rid of temporary
> exclusions,
> > > like the one you pointed out for schema. It would be great to improve
> > what
> > > we catch there.
> > >
> > > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > > > Hi Ryan,
> > > >
> > > > It is a different topic but would like to reflect shortly.
> > > > I understand that 1.7.0 was the first apache release. The problem
> with
> > > > doing the compatibility checks comparing to 1.7.0 is that we can
> easily
> > > add
> > > > incompatibilities in API which are added after 1.7.0. For example:
> > > Adding a
> > > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > > compatibility check would not discover this breaking change. So, I
> > > think, a
> > > > better approach would be to always compare to the previous minor
> > release
> > > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > > As I've written before, even org/apache/parquet/schema/** is excluded
> > > from
> > > > the compatibility check. As far as I know this is public API. So, I
> am
> > > not
> > > > sure that only packages that are not part of the public API are
> > excluded.
> > > >
> > > > Let's discuss this on the next parquet sync.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rblue@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Gabor,
> > > > >
> > > > > 1.7.0 was the first version using the org.apache.parquet packages,
> so
> > > > > that's the correct base version for compatibility checks. The
> > > exclusions
> > > > in
> > > > > the POM are classes that the Parquet community does not consider
> > > public.
> > > > We
> > > > > rely on these checks to highlight binary incompatibilities, and
> then
> > we
> > > > > discuss them on this list or in the dev sync. If the class is
> > internal,
> > > > we
> > > > > add an exclusion for it.
> > > > >
> > > > > I know you're familiar with this process since we've talked about
> it
> > > > > before. I also know that you'd rather have more strict binary
> > > > > compatibility, but until we have someone with the time to do some
> > > > > maintenance and build a public API module, I'm afraid that's what
> we
> > > have
> > > > > to work with.
> > > > >
> > > > > Michael,
> > > > >
> > > > > I hope the context above is helpful and explains why running a
> binary
> > > > > compatibility check tool will find incompatible changes. We allow
> > > binary
> > > > > incompatible changes to internal classes and modules, like
> > > > parquet-common.
> > > > >
> > > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <
> gabor@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Ryan,
> > > > > > I would not trust our compatibility checks (semver) too much.
> > > > Currently,
> > > > > it
> > > > > > is configured to compare our current version to 1.7.0. It means
> > > > anything
> > > > > > that is added since 1.7.0 and then broke in a later release won't
> > be
> > > > > > caught. In addition, many packages are excluded from the check
> > > because
> > > > of
> > > > > > different reasons. For example org/apache/parquet/schema/** is
> > > excluded
> > > > > so
> > > > > > if it would really be an API compatibility issue we certainly
> > > wouldn't
> > > > > > catch it.
> > > > > >
> > > > > > Michael,
> > > > > > It fails because of a NoSuchMethodError pointing to a method that
> > is
> > > > > newly
> > > > > > introduced in 1.11. Both the caller and the callee shipped by
> > > > parquet-mr.
> > > > > > So, I'm quite sure it is a classpath issue. It seems that the
> 1.11
> > > > > version
> > > > > > of the parquet-column jar is not on the classpath.
> > > > > >
> > > > > >
> > > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <heuermh@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > The dependency versions are consistent in our artifact
> > > > > > >
> > > > > > > $ mvn dependency:tree | grep parquet
> > > > > > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > > [INFO] |     \-
> > > > > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > > [INFO] |  +-
> org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > > [INFO] |  |  +-
> > > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > > [INFO] |  |  \-
> > > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > > [INFO] |  +-
> org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > > [INFO] |  |  +-
> > > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > > >
> > > > > > > The latter error
> > > > > > >
> > > > > > > Caused by: org.apache.spark.SparkException: Job aborted due to
> > > stage
> > > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent
> failure:
> > > > Lost
> > > > > > task
> > > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > > java.lang.NoSuchMethodError:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > >
> > > > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > > > > >
> > > > > > > $ spark-submit --version
> > > > > > > Welcome to
> > > > > > >       ____              __
> > > > > > >      / __/__  ___ _____/ /__
> > > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > > >       /_/
> > > > > > >
> > > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> > > > > 1.8.0_191
> > > > > > > Branch
> > > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > > Revision
> > > > > > > Url
> > > > > > > Type --help for more information.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> > <rblue@netflix.com.INVALID
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for looking into it, Nandor. That doesn't sound like a
> > > > problem
> > > > > > > with
> > > > > > > > Parquet, but a problem with the test environment since
> > > parquet-avro
> > > > > > > depends
> > > > > > > > on a newer API method.
> > > > > > > >
> > > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > > <nk...@cloudera.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> I'm not sure that this is a binary compatibility issue. The
> > > > missing
> > > > > > > builder
> > > > > > > >> method was recently added in 1.11.0 with the introduction of
> > the
> > > > new
> > > > > > > >> logical type API, while the original version (one with a
> > single
> > > > > > > >> OriginalType input parameter called before by
> > > AvroSchemaConverter)
> > > > > of
> > > > > > > this
> > > > > > > >> method is kept untouched. It seems to me that the Parquet
> > > version
> > > > on
> > > > > > > Spark
> > > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> > parquet-column
> > > > is
> > > > > > > still
> > > > > > > >> on an older version.
> > > > > > > >>
> > > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > > heuermh@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> Perhaps not strictly necessary to say, but if this
> particular
> > > > > > > >>> compatibility break between 1.10 and 1.11 was intentional,
> > and
> > > no
> > > > > > other
> > > > > > > >>> compatibility breaks are found, I would vote -1
> (non-binding)
> > > on
> > > > > this
> > > > > > > RC
> > > > > > > >>> such that we might go back and revisit the changes to
> > preserve
> > > > > > > >>> compatibility.
> > > > > > > >>>
> > > > > > > >>> I am not sure there is presently enough motivation in the
> > Spark
> > > > > > project
> > > > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump
> the
> > > > > Parquet
> > > > > > > >>> dependency version to 1.11.x.
> > > > > > > >>>
> > > > > > > >>>   michael
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > > <rblue@netflix.com.INVALID
> > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public
> > APIs?
> > > > > From
> > > > > > > the
> > > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > > > > compatibility
> > > > > > > >> in
> > > > > > > >>>> the type builders.
> > > > > > > >>>>
> > > > > > > >>>> Looks like this should have been caught by the binary
> > > > > compatibility
> > > > > > > >>> checks.
> > > > > > > >>>>
> > > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > > gabor@apache.org>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Hi Michael,
> > > > > > > >>>>>
> > > > > > > >>>>> Unfortunately, I don't have too much experience on Spark.
> > But
> > > > if
> > > > > > > spark
> > > > > > > >>> uses
> > > > > > > >>>>> the parquet-mr library in an embedded way (that's how
> Hive
> > > uses
> > > > > it)
> > > > > > > it
> > > > > > > >>> is
> > > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > > >>>>>
> > > > > > > >>>>> Regards,
> > > > > > > >>>>> Gabor
> > > > > > > >>>>>
> > > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > > heuermh@gmail.com
> > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> It appears a provided scope dependency on spark-sql
> leaks
> > > old
> > > > > > > parquet
> > > > > > > >>>>>> versions was causing the runtime error below.  After
> > > including
> > > > > new
> > > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> > dependencies
> > > > (in
> > > > > > > >>> addition
> > > > > > > >>>>>> to parquet-avro, which we already have at compile
> scope),
> > > our
> > > > > > build
> > > > > > > >>>>>> succeeds.
> > > > > > > >>>>>>
> > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > > > >>>>>>
> > > > > > > >>>>>> However, when running via spark-submit I run into a
> > similar
> > > > > > runtime
> > > > > > > >>> error
> > > > > > > >>>>>>
> > > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > > > > >>
> > > > > >
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > >>>>>>       at
> > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > >>>>>>       at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > > >>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>> Will bumping our library dependency version to 1.11
> > require
> > > a
> > > > > new
> > > > > > > >>> version
> > > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > > >>>>>>
> > > > > > > >>>>>> Please accept my apologies if this is heading
> out-of-scope
> > > for
> > > > > the
> > > > > > > >>>>> Parquet
> > > > > > > >>>>>> mailing list.
> > > > > > > >>>>>>
> > > > > > > >>>>>>  michael
> > > > > > > >>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > > heuermh@GMAIL.COM
> > > > > >
> > > > > > > >>> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I am willing to do some benchmarking on genomic data at
> > > scale
> > > > > but
> > > > > > > am
> > > > > > > >>>>> not
> > > > > > > >>>>>> quite sure what the Spark target version for 1.11.0
> might
> > > be.
> > > > > > Will
> > > > > > > >>>>> Parquet
> > > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our
> > > build
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> …
> > > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > > java.lang.NoClassDefFoundError:
> > > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > > > > >>
> > > > > >
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > > >>>>>>
> > > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > > >>>>>>>     at
> > org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > > >>>>>>>     at
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > > >>>>>>>     at
> > > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > > >>>>>>>     at
> > > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > > >>>>>>>     at
> > > > > > > >>
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > > >>>>>>>     at
> > > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> michael
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > > gabor@apache.org
> > > > > > >
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm
> afraid, I
> > > > won't
> > > > > > > have
> > > > > > > >>>>>> enough
> > > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Cheers,
> > > > > > > >>>>>>>> Gabor
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > > > >>>>> <fokko@driesprong.frl
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change
> my
> > > > vote
> > > > > to
> > > > > > > +1
> > > > > > > >>>>>>>>> (non-binding).
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with
> real
> > > data
> > > > > to
> > > > > > > see
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > > > > > >>> gabor@apache.org
> > > > > > > >>>>>>
> > > > > > > >>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes
> > > > feature
> > > > > > we
> > > > > > > >>>>>>>>> introduced
> > > > > > > >>>>>>>>>>> two new structures saved before the footer: column
> > > > indexes
> > > > > > and
> > > > > > > >>>>> offset
> > > > > > > >>>>>>>>>>> indexes. If the min/max values are not too long,
> then
> > > the
> > > > > > > >>>>> truncation
> > > > > > > >>>>>>>>>> might
> > > > > > > >>>>>>>>>>> not decrease the file size because of the offset
> > > indexes.
> > > > > > > >>> Moreover,
> > > > > > > >>>>>> we
> > > > > > > >>>>>>>>>> also
> > > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> > > > > increase
> > > > > > > the
> > > > > > > >>>>>> number
> > > > > > > >>>>>>>>>> of
> > > > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > > > >>>>>>>>>>> The footer itself is also changed and we are saving
> > > more
> > > > > > values
> > > > > > > >> in
> > > > > > > >>>>>> it:
> > > > > > > >>>>>>>>>> the
> > > > > > > >>>>>>>>>>> offset values to the column/offset indexes, the new
> > > > logical
> > > > > > > type
> > > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> > > > others).
> > > > > > > >>>>>>>>>>> So, the size of the files with small amount of data
> > > will
> > > > be
> > > > > > > >>>>> increased
> > > > > > > >>>>>>>>>>> (because of the larger footer). The size of the
> files
> > > > where
> > > > > > the
> > > > > > > >>>>>> values
> > > > > > > >>>>>>>>>> can
> > > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be
> increased
> > > > > > (because
> > > > > > > >> we
> > > > > > > >>>>>> will
> > > > > > > >>>>>>>>>> have
> > > > > > > >>>>>>>>>>> more pages). The size of some files where the
> values
> > > are
> > > > > long
> > > > > > > >>>>>> (>64bytes
> > > > > > > >>>>>>>>>> by
> > > > > > > >>>>>>>>>>> default) might be decreased because of truncating
> the
> > > > > min/max
> > > > > > > >>>>> values.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Regards,
> > > > > > > >>>>>>>>>>> Gabor
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional
> overhead
> > > > for a
> > > > > > > >>>>> non-test
> > > > > > > >>>>>>>>>> data
> > > > > > > >>>>>>>>>>>> file? It should be easy to validate that this
> > doesn't
> > > > > > > introduce
> > > > > > > >>> an
> > > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it
> > > > should
> > > > > > > >>> actually
> > > > > > > >>>>>> be
> > > > > > > >>>>>>>>>>>> smaller since the column indexes are truncated and
> > > page
> > > > > > stats
> > > > > > > >> are
> > > > > > > >>>>>>>>> not.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> For the first point. The referenced constructor
> is
> > > > > private
> > > > > > > and
> > > > > > > >>>>>>>>>> Iceberg
> > > > > > > >>>>>>>>>>>> uses
> > > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I
> > > > think,
> > > > > > > >>>>> parquet-mr
> > > > > > > >>>>>>>>>>> shall
> > > > > > > >>>>>>>>>>>>> not keep private methods only because of clients
> > > might
> > > > > use
> > > > > > > >> them
> > > > > > > >>>>> via
> > > > > > > >>>>>>>>>>>>> reflection.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> > > > > checksum
> > > > > > > >> write
> > > > > > > >>>>>>>>>>> enabled
> > > > > > > >>>>>>>>>>>> by
> > > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > > significant
> > > > > > > >>>>> performance
> > > > > > > >>>>>>>>>>>>> penalties. See
> > > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > > >>> for
> > > > > > > >>>>>>>>>>>> details.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing
> > > > column
> > > > > > > >>> indexes,
> > > > > > > >>>>>>>>> CRC
> > > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page
> > > headers
> > > > > and
> > > > > > > >>> maybe
> > > > > > > >>>>>>>>>> other
> > > > > > > >>>>>>>>>>>>> changes that impact file size. If only file size
> is
> > > in
> > > > > > > >> question
> > > > > > > >>> I
> > > > > > > >>>>>>>>>>> cannot
> > > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Regards,
> > > > > > > >>>>>>>>>>>>> Gabor
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and
> found
> > > > three
> > > > > > > >> things:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > > > constructor
> > > > > > of
> > > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > > >>>>>>>>>>>>>> <
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > > >>>>>>>>>>>>>>> .
> > > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > > >>>>>>>>>>>>>> <
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if
> > there
> > > > > will
> > > > > > be
> > > > > > > >> a
> > > > > > > >>>>>>>>>> new
> > > > > > > >>>>>>>>>>>> RC,
> > > > > > > >>>>>>>>>>>>>> I've
> > > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog,
> > is
> > > > that
> > > > > > > >>>>>>>>>> checksums
> > > > > > > >>>>>>>>>>>> are
> > > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > > >>>>>>>>>>>>>> This
> > > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest
> disabling
> > > it
> > > > by
> > > > > > > >>>>>>>>>> default:
> > > > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > > > > >>>>>>>>>>>>>> <
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg,
> > I've
> > > > > > noticed
> > > > > > > >>>>>>>>>> that
> > > > > > > >>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > > >>>>>>>>>>>>>> The
> > > > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > > > partitions.
> > > > > > > >>>>>>>>>> Something
> > > > > > > >>>>>>>>>>> in
> > > > > > > >>>>>>>>>>>>> the
> > > > > > > >>>>>>>>>>>>>> output has changed since the files are bigger
> now.
> > > Has
> > > > > > > anyone
> > > > > > > >>>>>>>>>> any
> > > > > > > >>>>>>>>>>>> idea
> > > > > > > >>>>>>>>>>>>>> to
> > > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this?
> The
> > > only
> > > > > > thing
> > > > > > > >> I
> > > > > > > >>>>>>>>>> can
> > > > > > > >>>>>>>>>>>>>> think of
> > > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17
> nov
> > > > 21:09
> > > > > > > >>>>>>>>>>>>>>
> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17
> nov
> > > > 21:05
> > > > > > > >>>>>>>>>>>>>>
> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > >>>>>>>>>>>>
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > > >>>>>>>>>>>>
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > > >>>>>>>>>>>>>> data = a
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>
> > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen
> <
> > > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > > >>>>>>>>>>>>>>> :
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> +1
> > > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn
> install
> > > > > > > >> successfully.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > > 于2019年11月14日周四
> > > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> > > > build/sbt
> > > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > > > > >> gabor@apache.org>
> > > > > > > >>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> > > > official
> > > > > > > >>>>>>>>>> Apache
> > > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > > apache-parquet-1.11.0-rc7
> > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums
> > are
> > > > > here:
> > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > > > >>>>>>>>>>>>>>>> *
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> --
> > > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > > >>>>>>>>>>>> Netflix
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> --
> > > > > > > >>>>>>>>>> Ryan Blue
> > > > > > > >>>>>>>>>> Software Engineer
> > > > > > > >>>>>>>>>> Netflix
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> --
> > > > > > > >>>> Ryan Blue
> > > > > > > >>>> Software Engineer
> > > > > > > >>>> Netflix
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ryan Blue
> > > > > > > > Software Engineer
> > > > > > > > Netflix
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.

+1 (binding)

- I have read through the problem reports in this e-mail thread (one caused
by the use of a private method via reflection an another one caused by
having mixed versions of the libraries on the classpath) and I am convinced
that they do not block the release.
- Signature and hash of the source tarball are valid.
- The specified git hash matches the specified git tag.
- The contents of the source tarball match the contents of the git repo at
the specified tag.

Br,

Zoltan


On Tue, Nov 26, 2019 at 10:54 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Created https://issues.apache.org/jira/browse/PARQUET-1703 to track this.
>
> Back to the RC. Anyone from the PMC willing to vote?
>
> Cheers,
> Gabor
>
> On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Gabor, good point about not being able to check new APIs. Updating the
> > previous version would also allow us to get rid of temporary exclusions,
> > like the one you pointed out for schema. It would be great to improve
> what
> > we catch there.
> >
> > On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
> >
> > > Hi Ryan,
> > >
> > > It is a different topic but would like to reflect shortly.
> > > I understand that 1.7.0 was the first apache release. The problem with
> > > doing the compatibility checks comparing to 1.7.0 is that we can easily
> > add
> > > incompatibilities in API which are added after 1.7.0. For example:
> > Adding a
> > > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > > compatibility check would not discover this breaking change. So, I
> > think, a
> > > better approach would be to always compare to the previous minor
> release
> > > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > > As I've written before, even org/apache/parquet/schema/** is excluded
> > from
> > > the compatibility check. As far as I know this is public API. So, I am
> > not
> > > sure that only packages that are not part of the public API are
> excluded.
> > >
> > > Let's discuss this on the next parquet sync.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Gabor,
> > > >
> > > > 1.7.0 was the first version using the org.apache.parquet packages, so
> > > > that's the correct base version for compatibility checks. The
> > exclusions
> > > in
> > > > the POM are classes that the Parquet community does not consider
> > public.
> > > We
> > > > rely on these checks to highlight binary incompatibilities, and then
> we
> > > > discuss them on this list or in the dev sync. If the class is
> internal,
> > > we
> > > > add an exclusion for it.
> > > >
> > > > I know you're familiar with this process since we've talked about it
> > > > before. I also know that you'd rather have more strict binary
> > > > compatibility, but until we have someone with the time to do some
> > > > maintenance and build a public API module, I'm afraid that's what we
> > have
> > > > to work with.
> > > >
> > > > Michael,
> > > >
> > > > I hope the context above is helpful and explains why running a binary
> > > > compatibility check tool will find incompatible changes. We allow
> > binary
> > > > incompatible changes to internal classes and modules, like
> > > parquet-common.
> > > >
> > > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org>
> > > > wrote:
> > > >
> > > > > Ryan,
> > > > > I would not trust our compatibility checks (semver) too much.
> > > Currently,
> > > > it
> > > > > is configured to compare our current version to 1.7.0. It means
> > > anything
> > > > > that is added since 1.7.0 and then broke in a later release won't
> be
> > > > > caught. In addition, many packages are excluded from the check
> > because
> > > of
> > > > > different reasons. For example org/apache/parquet/schema/** is
> > excluded
> > > > so
> > > > > if it would really be an API compatibility issue we certainly
> > wouldn't
> > > > > catch it.
> > > > >
> > > > > Michael,
> > > > > It fails because of a NoSuchMethodError pointing to a method that
> is
> > > > newly
> > > > > introduced in 1.11. Both the caller and the callee shipped by
> > > parquet-mr.
> > > > > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> > > > version
> > > > > of the parquet-column jar is not on the classpath.
> > > > >
> > > > >
> > > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com>
> > > wrote:
> > > > >
> > > > > > The dependency versions are consistent in our artifact
> > > > > >
> > > > > > $ mvn dependency:tree | grep parquet
> > > > > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > > [INFO] |     \-
> > > > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > > [INFO] |  |  +-
> > org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > > [INFO] |  |  \-
> > > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > > [INFO] |  |  +-
> > org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > > >
> > > > > > The latter error
> > > > > >
> > > > > > Caused by: org.apache.spark.SparkException: Job aborted due to
> > stage
> > > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
> > > Lost
> > > > > task
> > > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > > java.lang.NoSuchMethodError:
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > >         at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > >
> > > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > > > >
> > > > > > $ spark-submit --version
> > > > > > Welcome to
> > > > > >       ____              __
> > > > > >      / __/__  ___ _____/ /__
> > > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > > >       /_/
> > > > > >
> > > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> > > > 1.8.0_191
> > > > > > Branch
> > > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > > Revision
> > > > > > Url
> > > > > > Type --help for more information.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue
> <rblue@netflix.com.INVALID
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Thanks for looking into it, Nandor. That doesn't sound like a
> > > problem
> > > > > > with
> > > > > > > Parquet, but a problem with the test environment since
> > parquet-avro
> > > > > > depends
> > > > > > > on a newer API method.
> > > > > > >
> > > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > > <nk...@cloudera.com.invalid>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I'm not sure that this is a binary compatibility issue. The
> > > missing
> > > > > > builder
> > > > > > >> method was recently added in 1.11.0 with the introduction of
> the
> > > new
> > > > > > >> logical type API, while the original version (one with a
> single
> > > > > > >> OriginalType input parameter called before by
> > AvroSchemaConverter)
> > > > of
> > > > > > this
> > > > > > >> method is kept untouched. It seems to me that the Parquet
> > version
> > > on
> > > > > > Spark
> > > > > > >> executor mismatch: parquet-avro is on 1.11.0, but
> parquet-column
> > > is
> > > > > > still
> > > > > > >> on an older version.
> > > > > > >>
> > > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> > heuermh@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >>> Perhaps not strictly necessary to say, but if this particular
> > > > > > >>> compatibility break between 1.10 and 1.11 was intentional,
> and
> > no
> > > > > other
> > > > > > >>> compatibility breaks are found, I would vote -1 (non-binding)
> > on
> > > > this
> > > > > > RC
> > > > > > >>> such that we might go back and revisit the changes to
> preserve
> > > > > > >>> compatibility.
> > > > > > >>>
> > > > > > >>> I am not sure there is presently enough motivation in the
> Spark
> > > > > project
> > > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> > > > Parquet
> > > > > > >>> dependency version to 1.11.x.
> > > > > > >>>
> > > > > > >>>   michael
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > > <rblue@netflix.com.INVALID
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>>
> > > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public
> APIs?
> > > > From
> > > > > > the
> > > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > > > compatibility
> > > > > > >> in
> > > > > > >>>> the type builders.
> > > > > > >>>>
> > > > > > >>>> Looks like this should have been caught by the binary
> > > > compatibility
> > > > > > >>> checks.
> > > > > > >>>>
> > > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > > gabor@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>>
> > > > > > >>>>> Hi Michael,
> > > > > > >>>>>
> > > > > > >>>>> Unfortunately, I don't have too much experience on Spark.
> But
> > > if
> > > > > > spark
> > > > > > >>> uses
> > > > > > >>>>> the parquet-mr library in an embedded way (that's how Hive
> > uses
> > > > it)
> > > > > > it
> > > > > > >>> is
> > > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > > >>>>>
> > > > > > >>>>> Regards,
> > > > > > >>>>> Gabor
> > > > > > >>>>>
> > > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > > heuermh@gmail.com
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> It appears a provided scope dependency on spark-sql leaks
> > old
> > > > > > parquet
> > > > > > >>>>>> versions was causing the runtime error below.  After
> > including
> > > > new
> > > > > > >>>>>> parquet-column and parquet-hadoop compile scope
> dependencies
> > > (in
> > > > > > >>> addition
> > > > > > >>>>>> to parquet-avro, which we already have at compile scope),
> > our
> > > > > build
> > > > > > >>>>>> succeeds.
> > > > > > >>>>>>
> > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > > >>>>>>
> > > > > > >>>>>> However, when running via spark-submit I run into a
> similar
> > > > > runtime
> > > > > > >>> error
> > > > > > >>>>>>
> > > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > > > > >>
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > >>>>>>       at
> org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > >>>>>>       at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> Will bumping our library dependency version to 1.11
> require
> > a
> > > > new
> > > > > > >>> version
> > > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > > >>>>>>
> > > > > > >>>>>> Please accept my apologies if this is heading out-of-scope
> > for
> > > > the
> > > > > > >>>>> Parquet
> > > > > > >>>>>> mailing list.
> > > > > > >>>>>>
> > > > > > >>>>>>  michael
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > > heuermh@GMAIL.COM
> > > > >
> > > > > > >>> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> I am willing to do some benchmarking on genomic data at
> > scale
> > > > but
> > > > > > am
> > > > > > >>>>> not
> > > > > > >>>>>> quite sure what the Spark target version for 1.11.0 might
> > be.
> > > > > Will
> > > > > > >>>>> Parquet
> > > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > > >>>>>>>
> > > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our
> > build
> > > > > > >>>>>>>
> > > > > > >>>>>>> …
> > > > > > >>>>>>> D 0, localhost, executor driver):
> > > > java.lang.NoClassDefFoundError:
> > > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>
> > > > > >
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > > > > >>
> > > > >
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > > >>>>>>
> > > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > > >>>>>>>     at
> org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > >
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > > >>>>>>>     at
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > > >>>>>>>     at
> > > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > > >>>>>>>     at
> > java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > > >>>>>>>     at
> > > > > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > > >>>>>>>     at
> > java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > > >>>>>>>
> > > > > > >>>>>>> michael
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > > gabor@apache.org
> > > > > >
> > > > > > >>>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thanks, Fokko.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I
> > > won't
> > > > > > have
> > > > > > >>>>>> enough
> > > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Cheers,
> > > > > > >>>>>>>> Gabor
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > > >>>>> <fokko@driesprong.frl
> > > > > > >>>>>>>
> > > > > > >>>>>>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my
> > > vote
> > > > to
> > > > > > +1
> > > > > > >>>>>>>>> (non-binding).
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Cheers, Fokko
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real
> > data
> > > > to
> > > > > > see
> > > > > > >>>>> the
> > > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > > > > >>> gabor@apache.org
> > > > > > >>>>>>
> > > > > > >>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>> Hi Ryan,
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes
> > > feature
> > > > > we
> > > > > > >>>>>>>>> introduced
> > > > > > >>>>>>>>>>> two new structures saved before the footer: column
> > > indexes
> > > > > and
> > > > > > >>>>> offset
> > > > > > >>>>>>>>>>> indexes. If the min/max values are not too long, then
> > the
> > > > > > >>>>> truncation
> > > > > > >>>>>>>>>> might
> > > > > > >>>>>>>>>>> not decrease the file size because of the offset
> > indexes.
> > > > > > >>> Moreover,
> > > > > > >>>>>> we
> > > > > > >>>>>>>>>> also
> > > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> > > > increase
> > > > > > the
> > > > > > >>>>>> number
> > > > > > >>>>>>>>>> of
> > > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > > >>>>>>>>>>> The footer itself is also changed and we are saving
> > more
> > > > > values
> > > > > > >> in
> > > > > > >>>>>> it:
> > > > > > >>>>>>>>>> the
> > > > > > >>>>>>>>>>> offset values to the column/offset indexes, the new
> > > logical
> > > > > > type
> > > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> > > others).
> > > > > > >>>>>>>>>>> So, the size of the files with small amount of data
> > will
> > > be
> > > > > > >>>>> increased
> > > > > > >>>>>>>>>>> (because of the larger footer). The size of the files
> > > where
> > > > > the
> > > > > > >>>>>> values
> > > > > > >>>>>>>>>> can
> > > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > > > > (because
> > > > > > >> we
> > > > > > >>>>>> will
> > > > > > >>>>>>>>>> have
> > > > > > >>>>>>>>>>> more pages). The size of some files where the values
> > are
> > > > long
> > > > > > >>>>>> (>64bytes
> > > > > > >>>>>>>>>> by
> > > > > > >>>>>>>>>>> default) might be decreased because of truncating the
> > > > min/max
> > > > > > >>>>> values.
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> Regards,
> > > > > > >>>>>>>>>>> Gabor
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > > >>>>> <rblue@netflix.com.invalid
> > > > > > >>>>>>>
> > > > > > >>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead
> > > for a
> > > > > > >>>>> non-test
> > > > > > >>>>>>>>>> data
> > > > > > >>>>>>>>>>>> file? It should be easy to validate that this
> doesn't
> > > > > > introduce
> > > > > > >>> an
> > > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it
> > > should
> > > > > > >>> actually
> > > > > > >>>>>> be
> > > > > > >>>>>>>>>>>> smaller since the column indexes are truncated and
> > page
> > > > > stats
> > > > > > >> are
> > > > > > >>>>>>>>> not.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> > > > private
> > > > > > and
> > > > > > >>>>>>>>>> Iceberg
> > > > > > >>>>>>>>>>>> uses
> > > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I
> > > think,
> > > > > > >>>>> parquet-mr
> > > > > > >>>>>>>>>>> shall
> > > > > > >>>>>>>>>>>>> not keep private methods only because of clients
> > might
> > > > use
> > > > > > >> them
> > > > > > >>>>> via
> > > > > > >>>>>>>>>>>>> reflection.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> > > > checksum
> > > > > > >> write
> > > > > > >>>>>>>>>>> enabled
> > > > > > >>>>>>>>>>>> by
> > > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> > significant
> > > > > > >>>>> performance
> > > > > > >>>>>>>>>>>>> penalties. See
> > > > > https://github.com/apache/parquet-mr/pull/647
> > > > > > >>> for
> > > > > > >>>>>>>>>>>> details.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing
> > > column
> > > > > > >>> indexes,
> > > > > > >>>>>>>>> CRC
> > > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page
> > headers
> > > > and
> > > > > > >>> maybe
> > > > > > >>>>>>>>>> other
> > > > > > >>>>>>>>>>>>> changes that impact file size. If only file size is
> > in
> > > > > > >> question
> > > > > > >>> I
> > > > > > >>>>>>>>>>> cannot
> > > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Regards,
> > > > > > >>>>>>>>>>>>> Gabor
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found
> > > three
> > > > > > >> things:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > > constructor
> > > > > of
> > > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > > >>>>>>>>>>>>>> <
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > >>>>>>>>>>>>>>> .
> > > > > > >>>>>>>>>>>>>> This required a change
> > > > > > >>>>>>>>>>>>>> <
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if
> there
> > > > will
> > > > > be
> > > > > > >> a
> > > > > > >>>>>>>>>> new
> > > > > > >>>>>>>>>>>> RC,
> > > > > > >>>>>>>>>>>>>> I've
> > > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog,
> is
> > > that
> > > > > > >>>>>>>>>> checksums
> > > > > > >>>>>>>>>>>> are
> > > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > >>>>>>>>>>>>>> This
> > > > > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling
> > it
> > > by
> > > > > > >>>>>>>>>> default:
> > > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > > > >>>>>>>>>>>>>> <
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg,
> I've
> > > > > noticed
> > > > > > >>>>>>>>>> that
> > > > > > >>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > >>>>>>>>>>>>>> The
> > > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > > partitions.
> > > > > > >>>>>>>>>> Something
> > > > > > >>>>>>>>>>> in
> > > > > > >>>>>>>>>>>>> the
> > > > > > >>>>>>>>>>>>>> output has changed since the files are bigger now.
> > Has
> > > > > > anyone
> > > > > > >>>>>>>>>> any
> > > > > > >>>>>>>>>>>> idea
> > > > > > >>>>>>>>>>>>>> to
> > > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The
> > only
> > > > > thing
> > > > > > >> I
> > > > > > >>>>>>>>>> can
> > > > > > >>>>>>>>>>>>>> think of
> > > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov
> > > 21:09
> > > > > > >>>>>>>>>>>>>>
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov
> > > 21:05
> > > > > > >>>>>>>>>>>>>>
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > >>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > >>>>>>>>>>>>>> data = a
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > > >>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > >>>>>>>>>>>>>> id = 1
> > > > > > >>>>>>>>>>>>>> data = a
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>
> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > > >>>>>>>>>>>>>>> :
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> +1
> > > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > > > > >> successfully.
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > > 于2019年11月14日周四
> > > > > > >>>>>>>>> 下午2:05写道：
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> +1
> > > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> > > build/sbt
> > > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > > > >> gabor@apache.org>
> > > > > > >>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> > > official
> > > > > > >>>>>>>>>> Apache
> > > > > > >>>>>>>>>>>>>> Parquet
> > > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > > >>>>>>>>>>>>>>>> release.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > > apache-parquet-1.11.0-rc7
> > > > > > >>>>>>>>>>>>>>>> *
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums
> are
> > > > here:
> > > > > > >>>>>>>>>>>>>>>> *
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > > >>>>>>>>>>>>>>>> *
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > > >>>>>>>>>>>>>>>> *
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> --
> > > > > > >>>>>>>>>>>> Ryan Blue
> > > > > > >>>>>>>>>>>> Software Engineer
> > > > > > >>>>>>>>>>>> Netflix
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> --
> > > > > > >>>>>>>>>> Ryan Blue
> > > > > > >>>>>>>>>> Software Engineer
> > > > > > >>>>>>>>>> Netflix
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>> Ryan Blue
> > > > > > >>>> Software Engineer
> > > > > > >>>> Netflix
> > > > > > >>>
> > > > > > >>>
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Created https://issues.apache.org/jira/browse/PARQUET-1703 to track this.

Back to the RC. Anyone from the PMC willing to vote?

Cheers,
Gabor

On Mon, Nov 25, 2019 at 6:45 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Gabor, good point about not being able to check new APIs. Updating the
> previous version would also allow us to get rid of temporary exclusions,
> like the one you pointed out for schema. It would be great to improve what
> we catch there.
>
> On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Hi Ryan,
> >
> > It is a different topic but would like to reflect shortly.
> > I understand that 1.7.0 was the first apache release. The problem with
> > doing the compatibility checks comparing to 1.7.0 is that we can easily
> add
> > incompatibilities in API which are added after 1.7.0. For example:
> Adding a
> > new class for public use in 1.8.0 then removing it in 1.9.0. The
> > compatibility check would not discover this breaking change. So, I
> think, a
> > better approach would be to always compare to the previous minor release
> > (e.g. comparing 1.9.0 to 1.8.0 etc.).
> > As I've written before, even org/apache/parquet/schema/** is excluded
> from
> > the compatibility check. As far as I know this is public API. So, I am
> not
> > sure that only packages that are not part of the public API are excluded.
> >
> > Let's discuss this on the next parquet sync.
> >
> > Regards,
> > Gabor
> >
> > On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Gabor,
> > >
> > > 1.7.0 was the first version using the org.apache.parquet packages, so
> > > that's the correct base version for compatibility checks. The
> exclusions
> > in
> > > the POM are classes that the Parquet community does not consider
> public.
> > We
> > > rely on these checks to highlight binary incompatibilities, and then we
> > > discuss them on this list or in the dev sync. If the class is internal,
> > we
> > > add an exclusion for it.
> > >
> > > I know you're familiar with this process since we've talked about it
> > > before. I also know that you'd rather have more strict binary
> > > compatibility, but until we have someone with the time to do some
> > > maintenance and build a public API module, I'm afraid that's what we
> have
> > > to work with.
> > >
> > > Michael,
> > >
> > > I hope the context above is helpful and explains why running a binary
> > > compatibility check tool will find incompatible changes. We allow
> binary
> > > incompatible changes to internal classes and modules, like
> > parquet-common.
> > >
> > > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org>
> > > wrote:
> > >
> > > > Ryan,
> > > > I would not trust our compatibility checks (semver) too much.
> > Currently,
> > > it
> > > > is configured to compare our current version to 1.7.0. It means
> > anything
> > > > that is added since 1.7.0 and then broke in a later release won't be
> > > > caught. In addition, many packages are excluded from the check
> because
> > of
> > > > different reasons. For example org/apache/parquet/schema/** is
> excluded
> > > so
> > > > if it would really be an API compatibility issue we certainly
> wouldn't
> > > > catch it.
> > > >
> > > > Michael,
> > > > It fails because of a NoSuchMethodError pointing to a method that is
> > > newly
> > > > introduced in 1.11. Both the caller and the callee shipped by
> > parquet-mr.
> > > > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> > > version
> > > > of the parquet-column jar is not on the classpath.
> > > >
> > > >
> > > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com>
> > wrote:
> > > >
> > > > > The dependency versions are consistent in our artifact
> > > > >
> > > > > $ mvn dependency:tree | grep parquet
> > > > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > > [INFO] |     \-
> > > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > > [INFO] |  |  +-
> org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > > [INFO] |  |  \-
> > org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > > [INFO] |  |  +-
> org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > > >
> > > > > The latter error
> > > > >
> > > > > Caused by: org.apache.spark.SparkException: Job aborted due to
> stage
> > > > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
> > Lost
> > > > task
> > > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > > java.lang.NoSuchMethodError:
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > >
> > > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > > >
> > > > > $ spark-submit --version
> > > > > Welcome to
> > > > >       ____              __
> > > > >      / __/__  ___ _____/ /__
> > > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > > >       /_/
> > > > >
> > > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> > > 1.8.0_191
> > > > > Branch
> > > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > > Revision
> > > > > Url
> > > > > Type --help for more information.
> > > > >
> > > > >
> > > > >
> > > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rblue@netflix.com.INVALID
> >
> > > > > wrote:
> > > > > >
> > > > > > Thanks for looking into it, Nandor. That doesn't sound like a
> > problem
> > > > > with
> > > > > > Parquet, but a problem with the test environment since
> parquet-avro
> > > > > depends
> > > > > > on a newer API method.
> > > > > >
> > > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > > <nk...@cloudera.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > >> I'm not sure that this is a binary compatibility issue. The
> > missing
> > > > > builder
> > > > > >> method was recently added in 1.11.0 with the introduction of the
> > new
> > > > > >> logical type API, while the original version (one with a single
> > > > > >> OriginalType input parameter called before by
> AvroSchemaConverter)
> > > of
> > > > > this
> > > > > >> method is kept untouched. It seems to me that the Parquet
> version
> > on
> > > > > Spark
> > > > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column
> > is
> > > > > still
> > > > > >> on an older version.
> > > > > >>
> > > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <
> heuermh@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >>> Perhaps not strictly necessary to say, but if this particular
> > > > > >>> compatibility break between 1.10 and 1.11 was intentional, and
> no
> > > > other
> > > > > >>> compatibility breaks are found, I would vote -1 (non-binding)
> on
> > > this
> > > > > RC
> > > > > >>> such that we might go back and revisit the changes to preserve
> > > > > >>> compatibility.
> > > > > >>>
> > > > > >>> I am not sure there is presently enough motivation in the Spark
> > > > project
> > > > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> > > Parquet
> > > > > >>> dependency version to 1.11.x.
> > > > > >>>
> > > > > >>>   michael
> > > > > >>>
> > > > > >>>
> > > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> > <rblue@netflix.com.INVALID
> > > >
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs?
> > > From
> > > > > the
> > > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > > compatibility
> > > > > >> in
> > > > > >>>> the type builders.
> > > > > >>>>
> > > > > >>>> Looks like this should have been caught by the binary
> > > compatibility
> > > > > >>> checks.
> > > > > >>>>
> > > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > > gabor@apache.org>
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi Michael,
> > > > > >>>>>
> > > > > >>>>> Unfortunately, I don't have too much experience on Spark. But
> > if
> > > > > spark
> > > > > >>> uses
> > > > > >>>>> the parquet-mr library in an embedded way (that's how Hive
> uses
> > > it)
> > > > > it
> > > > > >>> is
> > > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Gabor
> > > > > >>>>>
> > > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> > heuermh@gmail.com
> > > >
> > > > > >>> wrote:
> > > > > >>>>>
> > > > > >>>>>> It appears a provided scope dependency on spark-sql leaks
> old
> > > > > parquet
> > > > > >>>>>> versions was causing the runtime error below.  After
> including
> > > new
> > > > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies
> > (in
> > > > > >>> addition
> > > > > >>>>>> to parquet-avro, which we already have at compile scope),
> our
> > > > build
> > > > > >>>>>> succeeds.
> > > > > >>>>>>
> > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > > >>>>>>
> > > > > >>>>>> However, when running via spark-submit I run into a similar
> > > > runtime
> > > > > >>> error
> > > > > >>>>>>
> > > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > >>>>>>       at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > >>>>>>       at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Will bumping our library dependency version to 1.11 require
> a
> > > new
> > > > > >>> version
> > > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > > >>>>>>
> > > > > >>>>>> Please accept my apologies if this is heading out-of-scope
> for
> > > the
> > > > > >>>>> Parquet
> > > > > >>>>>> mailing list.
> > > > > >>>>>>
> > > > > >>>>>>  michael
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> > heuermh@GMAIL.COM
> > > >
> > > > > >>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> I am willing to do some benchmarking on genomic data at
> scale
> > > but
> > > > > am
> > > > > >>>>> not
> > > > > >>>>>> quite sure what the Spark target version for 1.11.0 might
> be.
> > > > Will
> > > > > >>>>> Parquet
> > > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > > >>>>>>>
> > > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our
> build
> > > > > >>>>>>>
> > > > > >>>>>>> …
> > > > > >>>>>>> D 0, localhost, executor driver):
> > > java.lang.NoClassDefFoundError:
> > > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>
> > > > >
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>
> > > >
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > > >>>>>>>     at org.apache.spark.internal.io
> > > > > >>>>>>
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > >>>>>>>     at
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > > >>>>>>>     at
> > > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > > >>>>>>>     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > > >>>>>>>     at
> > > > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > > >>>>>>>     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > > >>>>>>>
> > > > > >>>>>>> michael
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > > gabor@apache.org
> > > > >
> > > > > >>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks, Fokko.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I
> > won't
> > > > > have
> > > > > >>>>>> enough
> > > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Cheers,
> > > > > >>>>>>>> Gabor
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > > >>>>> <fokko@driesprong.frl
> > > > > >>>>>>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my
> > vote
> > > to
> > > > > +1
> > > > > >>>>>>>>> (non-binding).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Cheers, Fokko
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > > >>>>>> <rb...@netflix.com.invalid>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real
> data
> > > to
> > > > > see
> > > > > >>>>> the
> > > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > > > >>> gabor@apache.org
> > > > > >>>>>>
> > > > > >>>>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>> Hi Ryan,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes
> > feature
> > > > we
> > > > > >>>>>>>>> introduced
> > > > > >>>>>>>>>>> two new structures saved before the footer: column
> > indexes
> > > > and
> > > > > >>>>> offset
> > > > > >>>>>>>>>>> indexes. If the min/max values are not too long, then
> the
> > > > > >>>>> truncation
> > > > > >>>>>>>>>> might
> > > > > >>>>>>>>>>> not decrease the file size because of the offset
> indexes.
> > > > > >>> Moreover,
> > > > > >>>>>> we
> > > > > >>>>>>>>>> also
> > > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> > > increase
> > > > > the
> > > > > >>>>>> number
> > > > > >>>>>>>>>> of
> > > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > > >>>>>>>>>>> The footer itself is also changed and we are saving
> more
> > > > values
> > > > > >> in
> > > > > >>>>>> it:
> > > > > >>>>>>>>>> the
> > > > > >>>>>>>>>>> offset values to the column/offset indexes, the new
> > logical
> > > > > type
> > > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> > others).
> > > > > >>>>>>>>>>> So, the size of the files with small amount of data
> will
> > be
> > > > > >>>>> increased
> > > > > >>>>>>>>>>> (because of the larger footer). The size of the files
> > where
> > > > the
> > > > > >>>>>> values
> > > > > >>>>>>>>>> can
> > > > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > > > (because
> > > > > >> we
> > > > > >>>>>> will
> > > > > >>>>>>>>>> have
> > > > > >>>>>>>>>>> more pages). The size of some files where the values
> are
> > > long
> > > > > >>>>>> (>64bytes
> > > > > >>>>>>>>>> by
> > > > > >>>>>>>>>>> default) might be decreased because of truncating the
> > > min/max
> > > > > >>>>> values.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>> Gabor
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > > >>>>> <rblue@netflix.com.invalid
> > > > > >>>>>>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead
> > for a
> > > > > >>>>> non-test
> > > > > >>>>>>>>>> data
> > > > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > > > > introduce
> > > > > >>> an
> > > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it
> > should
> > > > > >>> actually
> > > > > >>>>>> be
> > > > > >>>>>>>>>>>> smaller since the column indexes are truncated and
> page
> > > > stats
> > > > > >> are
> > > > > >>>>>>>>> not.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Hi Fokko,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> > > private
> > > > > and
> > > > > >>>>>>>>>> Iceberg
> > > > > >>>>>>>>>>>> uses
> > > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I
> > think,
> > > > > >>>>> parquet-mr
> > > > > >>>>>>>>>>> shall
> > > > > >>>>>>>>>>>>> not keep private methods only because of clients
> might
> > > use
> > > > > >> them
> > > > > >>>>> via
> > > > > >>>>>>>>>>>>> reflection.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> > > checksum
> > > > > >> write
> > > > > >>>>>>>>>>> enabled
> > > > > >>>>>>>>>>>> by
> > > > > >>>>>>>>>>>>> default because the benchmarks did not show
> significant
> > > > > >>>>> performance
> > > > > >>>>>>>>>>>>> penalties. See
> > > > https://github.com/apache/parquet-mr/pull/647
> > > > > >>> for
> > > > > >>>>>>>>>>>> details.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing
> > column
> > > > > >>> indexes,
> > > > > >>>>>>>>> CRC
> > > > > >>>>>>>>>>>>> checksum, removing the statistics from the page
> headers
> > > and
> > > > > >>> maybe
> > > > > >>>>>>>>>> other
> > > > > >>>>>>>>>>>>> changes that impact file size. If only file size is
> in
> > > > > >> question
> > > > > >>> I
> > > > > >>>>>>>>>>> cannot
> > > > > >>>>>>>>>>>>> see a breaking change here.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Regards,
> > > > > >>>>>>>>>>>>> Gabor
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found
> > three
> > > > > >> things:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> > constructor
> > > > of
> > > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > >>>>>>>>>>>>>>> .
> > > > > >>>>>>>>>>>>>> This required a change
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there
> > > will
> > > > be
> > > > > >> a
> > > > > >>>>>>>>>> new
> > > > > >>>>>>>>>>>> RC,
> > > > > >>>>>>>>>>>>>> I've
> > > > > >>>>>>>>>>>>>> submitted a patch:
> > > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is
> > that
> > > > > >>>>>>>>>> checksums
> > > > > >>>>>>>>>>>> are
> > > > > >>>>>>>>>>>>>> enabled by default:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > >>>>>>>>>>>>>> This
> > > > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling
> it
> > by
> > > > > >>>>>>>>>> default:
> > > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > > >>>>>>>>>>>>>> <
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> > > > noticed
> > > > > >>>>>>>>>> that
> > > > > >>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> split-test was failing:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > >>>>>>>>>>>>>> The
> > > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> > partitions.
> > > > > >>>>>>>>>> Something
> > > > > >>>>>>>>>>> in
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> output has changed since the files are bigger now.
> Has
> > > > > anyone
> > > > > >>>>>>>>>> any
> > > > > >>>>>>>>>>>> idea
> > > > > >>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The
> only
> > > > thing
> > > > > >> I
> > > > > >>>>>>>>>> can
> > > > > >>>>>>>>>>>>>> think of
> > > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov
> > 21:09
> > > > > >>>>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov
> > 21:05
> > > > > >>>>>>>>>>>>>>
> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > >>>>>>>>>>>>>> id = 1
> > > > > >>>>>>>>>>>>>> data = a
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > >>>>>>>>>>>>>> id = 1
> > > > > >>>>>>>>>>>>>> data = a
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> A binary diff here:
> > > > > >>>>>>>>>>>>>>
> > > > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > > >>>>>>>>>>>>>>> :
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> +1
> > > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > > > >> successfully.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> > 于2019年11月14日周四
> > > > > >>>>>>>>> 下午2:05写道：
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> +1
> > > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> > build/sbt
> > > > > >>>>>>>>>>>>> "sql/test-only"
> > > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > > >> gabor@apache.org>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> > official
> > > > > >>>>>>>>>> Apache
> > > > > >>>>>>>>>>>>>> Parquet
> > > > > >>>>>>>>>>>>>>> 1.11.0
> > > > > >>>>>>>>>>>>>>>> release.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The commit id is
> > > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > > apache-parquet-1.11.0-rc7
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are
> > > here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > > >>>>>>>>>>>>>>>> *
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> --
> > > > > >>>>>>>>>>>> Ryan Blue
> > > > > >>>>>>>>>>>> Software Engineer
> > > > > >>>>>>>>>>>> Netflix
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> --
> > > > > >>>>>>>>>> Ryan Blue
> > > > > >>>>>>>>>> Software Engineer
> > > > > >>>>>>>>>> Netflix
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> Ryan Blue
> > > > > >>>> Software Engineer
> > > > > >>>> Netflix
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Ryan Blue
> > > > > > Software Engineer
> > > > > > Netflix
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor, good point about not being able to check new APIs. Updating the
previous version would also allow us to get rid of temporary exclusions,
like the one you pointed out for schema. It would be great to improve what
we catch there.

On Mon, Nov 25, 2019 at 1:56 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Ryan,
>
> It is a different topic but would like to reflect shortly.
> I understand that 1.7.0 was the first apache release. The problem with
> doing the compatibility checks comparing to 1.7.0 is that we can easily add
> incompatibilities in API which are added after 1.7.0. For example: Adding a
> new class for public use in 1.8.0 then removing it in 1.9.0. The
> compatibility check would not discover this breaking change. So, I think, a
> better approach would be to always compare to the previous minor release
> (e.g. comparing 1.9.0 to 1.8.0 etc.).
> As I've written before, even org/apache/parquet/schema/** is excluded from
> the compatibility check. As far as I know this is public API. So, I am not
> sure that only packages that are not part of the public API are excluded.
>
> Let's discuss this on the next parquet sync.
>
> Regards,
> Gabor
>
> On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Gabor,
> >
> > 1.7.0 was the first version using the org.apache.parquet packages, so
> > that's the correct base version for compatibility checks. The exclusions
> in
> > the POM are classes that the Parquet community does not consider public.
> We
> > rely on these checks to highlight binary incompatibilities, and then we
> > discuss them on this list or in the dev sync. If the class is internal,
> we
> > add an exclusion for it.
> >
> > I know you're familiar with this process since we've talked about it
> > before. I also know that you'd rather have more strict binary
> > compatibility, but until we have someone with the time to do some
> > maintenance and build a public API module, I'm afraid that's what we have
> > to work with.
> >
> > Michael,
> >
> > I hope the context above is helpful and explains why running a binary
> > compatibility check tool will find incompatible changes. We allow binary
> > incompatible changes to internal classes and modules, like
> parquet-common.
> >
> > On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> >
> > > Ryan,
> > > I would not trust our compatibility checks (semver) too much.
> Currently,
> > it
> > > is configured to compare our current version to 1.7.0. It means
> anything
> > > that is added since 1.7.0 and then broke in a later release won't be
> > > caught. In addition, many packages are excluded from the check because
> of
> > > different reasons. For example org/apache/parquet/schema/** is excluded
> > so
> > > if it would really be an API compatibility issue we certainly wouldn't
> > > catch it.
> > >
> > > Michael,
> > > It fails because of a NoSuchMethodError pointing to a method that is
> > newly
> > > introduced in 1.11. Both the caller and the callee shipped by
> parquet-mr.
> > > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> > version
> > > of the parquet-column jar is not on the classpath.
> > >
> > >
> > > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com>
> wrote:
> > >
> > > > The dependency versions are consistent in our artifact
> > > >
> > > > $ mvn dependency:tree | grep parquet
> > > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > > [INFO] |     \-
> > > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > > [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > > [INFO] |  |  \-
> org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > > [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > > >
> > > > The latter error
> > > >
> > > > Caused by: org.apache.spark.SparkException: Job aborted due to stage
> > > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
> Lost
> > > task
> > > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > > java.lang.NoSuchMethodError:
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > >         at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > >
> > > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > > >
> > > > $ spark-submit --version
> > > > Welcome to
> > > >       ____              __
> > > >      / __/__  ___ _____/ /__
> > > >     _\ \/ _ \/ _ `/ __/  '_/
> > > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > > >       /_/
> > > >
> > > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> > 1.8.0_191
> > > > Branch
> > > > Compiled by user  on 2019-08-27T21:21:38Z
> > > > Revision
> > > > Url
> > > > Type --help for more information.
> > > >
> > > >
> > > >
> > > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
> > > > wrote:
> > > > >
> > > > > Thanks for looking into it, Nandor. That doesn't sound like a
> problem
> > > > with
> > > > > Parquet, but a problem with the test environment since parquet-avro
> > > > depends
> > > > > on a newer API method.
> > > > >
> > > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > > <nk...@cloudera.com.invalid>
> > > > > wrote:
> > > > >
> > > > >> I'm not sure that this is a binary compatibility issue. The
> missing
> > > > builder
> > > > >> method was recently added in 1.11.0 with the introduction of the
> new
> > > > >> logical type API, while the original version (one with a single
> > > > >> OriginalType input parameter called before by AvroSchemaConverter)
> > of
> > > > this
> > > > >> method is kept untouched. It seems to me that the Parquet version
> on
> > > > Spark
> > > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column
> is
> > > > still
> > > > >> on an older version.
> > > > >>
> > > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <heuermh@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >>> Perhaps not strictly necessary to say, but if this particular
> > > > >>> compatibility break between 1.10 and 1.11 was intentional, and no
> > > other
> > > > >>> compatibility breaks are found, I would vote -1 (non-binding) on
> > this
> > > > RC
> > > > >>> such that we might go back and revisit the changes to preserve
> > > > >>> compatibility.
> > > > >>>
> > > > >>> I am not sure there is presently enough motivation in the Spark
> > > project
> > > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> > Parquet
> > > > >>> dependency version to 1.11.x.
> > > > >>>
> > > > >>>   michael
> > > > >>>
> > > > >>>
> > > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue
> <rblue@netflix.com.INVALID
> > >
> > > > >>> wrote:
> > > > >>>>
> > > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs?
> > From
> > > > the
> > > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > > compatibility
> > > > >> in
> > > > >>>> the type builders.
> > > > >>>>
> > > > >>>> Looks like this should have been caught by the binary
> > compatibility
> > > > >>> checks.
> > > > >>>>
> > > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> > gabor@apache.org>
> > > > >>> wrote:
> > > > >>>>
> > > > >>>>> Hi Michael,
> > > > >>>>>
> > > > >>>>> Unfortunately, I don't have too much experience on Spark. But
> if
> > > > spark
> > > > >>> uses
> > > > >>>>> the parquet-mr library in an embedded way (that's how Hive uses
> > it)
> > > > it
> > > > >>> is
> > > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > > >>>>>
> > > > >>>>> Regards,
> > > > >>>>> Gabor
> > > > >>>>>
> > > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <
> heuermh@gmail.com
> > >
> > > > >>> wrote:
> > > > >>>>>
> > > > >>>>>> It appears a provided scope dependency on spark-sql leaks old
> > > > parquet
> > > > >>>>>> versions was causing the runtime error below.  After including
> > new
> > > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies
> (in
> > > > >>> addition
> > > > >>>>>> to parquet-avro, which we already have at compile scope), our
> > > build
> > > > >>>>>> succeeds.
> > > > >>>>>>
> > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > > >>>>>>
> > > > >>>>>> However, when running via spark-submit I run into a similar
> > > runtime
> > > > >>> error
> > > > >>>>>>
> > > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > >>>>>>       at org.apache.spark.internal.io
> > > > >>>>>>
> > > > >>
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > >>>>>>       at org.apache.spark.internal.io
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > >>>>>>       at org.apache.spark.internal.io
> > > > >>>>>>
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > >>>>>>       at org.apache.spark.internal.io
> > > > >>>>>>
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > >>>>>>       at
> > > > >>>>>>
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > >>>>>>       at
> > > > >>>>>>
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > >>>>>>       at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Will bumping our library dependency version to 1.11 require a
> > new
> > > > >>> version
> > > > >>>>>> of Spark, built against Parquet 1.11?
> > > > >>>>>>
> > > > >>>>>> Please accept my apologies if this is heading out-of-scope for
> > the
> > > > >>>>> Parquet
> > > > >>>>>> mailing list.
> > > > >>>>>>
> > > > >>>>>>  michael
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <
> heuermh@GMAIL.COM
> > >
> > > > >>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> I am willing to do some benchmarking on genomic data at scale
> > but
> > > > am
> > > > >>>>> not
> > > > >>>>>> quite sure what the Spark target version for 1.11.0 might be.
> > > Will
> > > > >>>>> Parquet
> > > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > > >>>>>>>
> > > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > > > >>>>>>>
> > > > >>>>>>> …
> > > > >>>>>>> D 0, localhost, executor driver):
> > java.lang.NoClassDefFoundError:
> > > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>
> > > >
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > > >>>>>>>     at org.apache.spark.internal.io
> > > > >>>>>>
> > > > >>
> > > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > > >>>>>>>     at org.apache.spark.internal.io
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > > >>>>>>>     at org.apache.spark.internal.io
> > > > >>>>>>
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > > >>>>>>>     at org.apache.spark.internal.io
> > > > >>>>>>
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > > >>>>>>>     at
> > > > >>>>>>
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > > >>>>>>>     at
> > > > >>>>>>
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > >>>>>>>     at
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > > >>>>>>>     at
> > java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > > >>>>>>>     at
> > > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > > >>>>>>>
> > > > >>>>>>> michael
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> > gabor@apache.org
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks, Fokko.
> > > > >>>>>>>>
> > > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I
> won't
> > > > have
> > > > >>>>>> enough
> > > > >>>>>>>> time to do that in the next couple of weeks.
> > > > >>>>>>>>
> > > > >>>>>>>> Cheers,
> > > > >>>>>>>> Gabor
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > > >>>>> <fokko@driesprong.frl
> > > > >>>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my
> vote
> > to
> > > > +1
> > > > >>>>>>>>> (non-binding).
> > > > >>>>>>>>>
> > > > >>>>>>>>> Cheers, Fokko
> > > > >>>>>>>>>
> > > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > > >>>>>> <rb...@netflix.com.invalid>
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real data
> > to
> > > > see
> > > > >>>>> the
> > > > >>>>>>>>>> effect? I think those results would be helpful.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > > >>> gabor@apache.org
> > > > >>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hi Ryan,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> It is not easy to calculate. For the column indexes
> feature
> > > we
> > > > >>>>>>>>> introduced
> > > > >>>>>>>>>>> two new structures saved before the footer: column
> indexes
> > > and
> > > > >>>>> offset
> > > > >>>>>>>>>>> indexes. If the min/max values are not too long, then the
> > > > >>>>> truncation
> > > > >>>>>>>>>> might
> > > > >>>>>>>>>>> not decrease the file size because of the offset indexes.
> > > > >>> Moreover,
> > > > >>>>>> we
> > > > >>>>>>>>>> also
> > > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> > increase
> > > > the
> > > > >>>>>> number
> > > > >>>>>>>>>> of
> > > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > > >>>>>>>>>>> The footer itself is also changed and we are saving more
> > > values
> > > > >> in
> > > > >>>>>> it:
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>> offset values to the column/offset indexes, the new
> logical
> > > > type
> > > > >>>>>>>>>>> structures, the CRC checksums (we might have some
> others).
> > > > >>>>>>>>>>> So, the size of the files with small amount of data will
> be
> > > > >>>>> increased
> > > > >>>>>>>>>>> (because of the larger footer). The size of the files
> where
> > > the
> > > > >>>>>> values
> > > > >>>>>>>>>> can
> > > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > > (because
> > > > >> we
> > > > >>>>>> will
> > > > >>>>>>>>>> have
> > > > >>>>>>>>>>> more pages). The size of some files where the values are
> > long
> > > > >>>>>> (>64bytes
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>> default) might be decreased because of truncating the
> > min/max
> > > > >>>>> values.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Regards,
> > > > >>>>>>>>>>> Gabor
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > > >>>>> <rblue@netflix.com.invalid
> > > > >>>>>>>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead
> for a
> > > > >>>>> non-test
> > > > >>>>>>>>>> data
> > > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > > > introduce
> > > > >>> an
> > > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it
> should
> > > > >>> actually
> > > > >>>>>> be
> > > > >>>>>>>>>>>> smaller since the column indexes are truncated and page
> > > stats
> > > > >> are
> > > > >>>>>>>>> not.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hi Fokko,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> > private
> > > > and
> > > > >>>>>>>>>> Iceberg
> > > > >>>>>>>>>>>> uses
> > > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I
> think,
> > > > >>>>> parquet-mr
> > > > >>>>>>>>>>> shall
> > > > >>>>>>>>>>>>> not keep private methods only because of clients might
> > use
> > > > >> them
> > > > >>>>> via
> > > > >>>>>>>>>>>>> reflection.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> > checksum
> > > > >> write
> > > > >>>>>>>>>>> enabled
> > > > >>>>>>>>>>>> by
> > > > >>>>>>>>>>>>> default because the benchmarks did not show significant
> > > > >>>>> performance
> > > > >>>>>>>>>>>>> penalties. See
> > > https://github.com/apache/parquet-mr/pull/647
> > > > >>> for
> > > > >>>>>>>>>>>> details.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing
> column
> > > > >>> indexes,
> > > > >>>>>>>>> CRC
> > > > >>>>>>>>>>>>> checksum, removing the statistics from the page headers
> > and
> > > > >>> maybe
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>>>> changes that impact file size. If only file size is in
> > > > >> question
> > > > >>> I
> > > > >>>>>>>>>>> cannot
> > > > >>>>>>>>>>>>> see a breaking change here.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Regards,
> > > > >>>>>>>>>>>>> Gabor
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > > >>>>>>>>>> <fokko@driesprong.frl
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found
> three
> > > > >> things:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> - We've broken backward compatibility of the
> constructor
> > > of
> > > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > > >>>>>>>>>>>>>> <
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > >>>>>>>>>>>>>>> .
> > > > >>>>>>>>>>>>>> This required a change
> > > > >>>>>>>>>>>>>> <
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there
> > will
> > > be
> > > > >> a
> > > > >>>>>>>>>> new
> > > > >>>>>>>>>>>> RC,
> > > > >>>>>>>>>>>>>> I've
> > > > >>>>>>>>>>>>>> submitted a patch:
> > > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is
> that
> > > > >>>>>>>>>> checksums
> > > > >>>>>>>>>>>> are
> > > > >>>>>>>>>>>>>> enabled by default:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > >>>>>>>>>>>>>> This
> > > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling it
> by
> > > > >>>>>>>>>> default:
> > > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > > >>>>>>>>>>>>>> <
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> > > noticed
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> split-test was failing:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > >>>>>>>>>>>>>> The
> > > > >>>>>>>>>>>>>> two records are now divided over four Spark
> partitions.
> > > > >>>>>>>>>> Something
> > > > >>>>>>>>>>> in
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> output has changed since the files are bigger now. Has
> > > > anyone
> > > > >>>>>>>>>> any
> > > > >>>>>>>>>>>> idea
> > > > >>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The only
> > > thing
> > > > >> I
> > > > >>>>>>>>>> can
> > > > >>>>>>>>>>>>>> think of
> > > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov
> 21:09
> > > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov
> 21:05
> > > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > >>>>>>>>>>>>>> id = 1
> > > > >>>>>>>>>>>>>> data = a
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > >>>>>>>>>>>>>> id = 1
> > > > >>>>>>>>>>>>>> data = a
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> A binary diff here:
> > > > >>>>>>>>>>>>>>
> > > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Cheers, Fokko
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > > >>>>>>>>>>>>>>> :
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> +1
> > > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > > >> successfully.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid>
> 于2019年11月14日周四
> > > > >>>>>>>>> 下午2:05写道：
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> +1
> > > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module:
> build/sbt
> > > > >>>>>>>>>>>>> "sql/test-only"
> > > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > > >> gabor@apache.org>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Hi everyone,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> I propose the following RC to be released as
> official
> > > > >>>>>>>>>> Apache
> > > > >>>>>>>>>>>>>> Parquet
> > > > >>>>>>>>>>>>>>> 1.11.0
> > > > >>>>>>>>>>>>>>>> release.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> The commit id is
> > > 18519eb8e059865652eee3ff0e8593f126701da4
> > > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> > apache-parquet-1.11.0-rc7
> > > > >>>>>>>>>>>>>>>> *
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are
> > here:
> > > > >>>>>>>>>>>>>>>> *
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > > >>>>>>>>>>>>>>>> *
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > > >>>>>>>>>>>>>>>> *
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > > >>>>>>>>>>>>>>>> [ ] +0
> > > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> --
> > > > >>>>>>>>>>>> Ryan Blue
> > > > >>>>>>>>>>>> Software Engineer
> > > > >>>>>>>>>>>> Netflix
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> Ryan Blue
> > > > >>>>>>>>>> Software Engineer
> > > > >>>>>>>>>> Netflix
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> --
> > > > >>>> Ryan Blue
> > > > >>>> Software Engineer
> > > > >>>> Netflix
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Ryan,

It is a different topic but would like to reflect shortly.
I understand that 1.7.0 was the first apache release. The problem with
doing the compatibility checks comparing to 1.7.0 is that we can easily add
incompatibilities in API which are added after 1.7.0. For example: Adding a
new class for public use in 1.8.0 then removing it in 1.9.0. The
compatibility check would not discover this breaking change. So, I think, a
better approach would be to always compare to the previous minor release
(e.g. comparing 1.9.0 to 1.8.0 etc.).
As I've written before, even org/apache/parquet/schema/** is excluded from
the compatibility check. As far as I know this is public API. So, I am not
sure that only packages that are not part of the public API are excluded.

Let's discuss this on the next parquet sync.

Regards,
Gabor

On Fri, Nov 22, 2019 at 6:20 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Gabor,
>
> 1.7.0 was the first version using the org.apache.parquet packages, so
> that's the correct base version for compatibility checks. The exclusions in
> the POM are classes that the Parquet community does not consider public. We
> rely on these checks to highlight binary incompatibilities, and then we
> discuss them on this list or in the dev sync. If the class is internal, we
> add an exclusion for it.
>
> I know you're familiar with this process since we've talked about it
> before. I also know that you'd rather have more strict binary
> compatibility, but until we have someone with the time to do some
> maintenance and build a public API module, I'm afraid that's what we have
> to work with.
>
> Michael,
>
> I hope the context above is helpful and explains why running a binary
> compatibility check tool will find incompatible changes. We allow binary
> incompatible changes to internal classes and modules, like parquet-common.
>
> On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
>
> > Ryan,
> > I would not trust our compatibility checks (semver) too much. Currently,
> it
> > is configured to compare our current version to 1.7.0. It means anything
> > that is added since 1.7.0 and then broke in a later release won't be
> > caught. In addition, many packages are excluded from the check because of
> > different reasons. For example org/apache/parquet/schema/** is excluded
> so
> > if it would really be an API compatibility issue we certainly wouldn't
> > catch it.
> >
> > Michael,
> > It fails because of a NoSuchMethodError pointing to a method that is
> newly
> > introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> > So, I'm quite sure it is a classpath issue. It seems that the 1.11
> version
> > of the parquet-column jar is not on the classpath.
> >
> >
> > On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com> wrote:
> >
> > > The dependency versions are consistent in our artifact
> > >
> > > $ mvn dependency:tree | grep parquet
> > > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > > [INFO] |     \-
> > > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> > > [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > > [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> > >
> > > The latter error
> > >
> > > Caused by: org.apache.spark.SparkException: Job aborted due to stage
> > > failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost
> > task
> > > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > > java.lang.NoSuchMethodError:
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > >         at
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > >
> > > occurs when I attempt to run via spark-submit on Spark 2.4.4
> > >
> > > $ spark-submit --version
> > > Welcome to
> > >       ____              __
> > >      / __/__  ___ _____/ /__
> > >     _\ \/ _ \/ _ `/ __/  '_/
> > >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> > >       /_/
> > >
> > > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM,
> 1.8.0_191
> > > Branch
> > > Compiled by user  on 2019-08-27T21:21:38Z
> > > Revision
> > > Url
> > > Type --help for more information.
> > >
> > >
> > >
> > > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
> > > wrote:
> > > >
> > > > Thanks for looking into it, Nandor. That doesn't sound like a problem
> > > with
> > > > Parquet, but a problem with the test environment since parquet-avro
> > > depends
> > > > on a newer API method.
> > > >
> > > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > > <nk...@cloudera.com.invalid>
> > > > wrote:
> > > >
> > > >> I'm not sure that this is a binary compatibility issue. The missing
> > > builder
> > > >> method was recently added in 1.11.0 with the introduction of the new
> > > >> logical type API, while the original version (one with a single
> > > >> OriginalType input parameter called before by AvroSchemaConverter)
> of
> > > this
> > > >> method is kept untouched. It seems to me that the Parquet version on
> > > Spark
> > > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
> > > still
> > > >> on an older version.
> > > >>
> > > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com>
> > > wrote:
> > > >>
> > > >>> Perhaps not strictly necessary to say, but if this particular
> > > >>> compatibility break between 1.10 and 1.11 was intentional, and no
> > other
> > > >>> compatibility breaks are found, I would vote -1 (non-binding) on
> this
> > > RC
> > > >>> such that we might go back and revisit the changes to preserve
> > > >>> compatibility.
> > > >>>
> > > >>> I am not sure there is presently enough motivation in the Spark
> > project
> > > >>> for a release after 2.4.4 and before 3.0 in which to bump the
> Parquet
> > > >>> dependency version to 1.11.x.
> > > >>>
> > > >>>   michael
> > > >>>
> > > >>>
> > > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rblue@netflix.com.INVALID
> >
> > > >>> wrote:
> > > >>>>
> > > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs?
> From
> > > the
> > > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> > compatibility
> > > >> in
> > > >>>> the type builders.
> > > >>>>
> > > >>>> Looks like this should have been caught by the binary
> compatibility
> > > >>> checks.
> > > >>>>
> > > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <
> gabor@apache.org>
> > > >>> wrote:
> > > >>>>
> > > >>>>> Hi Michael,
> > > >>>>>
> > > >>>>> Unfortunately, I don't have too much experience on Spark. But if
> > > spark
> > > >>> uses
> > > >>>>> the parquet-mr library in an embedded way (that's how Hive uses
> it)
> > > it
> > > >>> is
> > > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Gabor
> > > >>>>>
> > > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <heuermh@gmail.com
> >
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> It appears a provided scope dependency on spark-sql leaks old
> > > parquet
> > > >>>>>> versions was causing the runtime error below.  After including
> new
> > > >>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
> > > >>> addition
> > > >>>>>> to parquet-avro, which we already have at compile scope), our
> > build
> > > >>>>>> succeeds.
> > > >>>>>>
> > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > > >>>>>>
> > > >>>>>> However, when running via spark-submit I run into a similar
> > runtime
> > > >>> error
> > > >>>>>>
> > > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>>
> > > >>
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > >>>>>>       at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > >>>>>>       at
> > > >>>>>>
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > >>>>>>       at
> > > >>>>>>
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > >>>>>>       at
> > > >>>>>>
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > >>>>>>       at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Will bumping our library dependency version to 1.11 require a
> new
> > > >>> version
> > > >>>>>> of Spark, built against Parquet 1.11?
> > > >>>>>>
> > > >>>>>> Please accept my apologies if this is heading out-of-scope for
> the
> > > >>>>> Parquet
> > > >>>>>> mailing list.
> > > >>>>>>
> > > >>>>>>  michael
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <heuermh@GMAIL.COM
> >
> > > >>> wrote:
> > > >>>>>>>
> > > >>>>>>> I am willing to do some benchmarking on genomic data at scale
> but
> > > am
> > > >>>>> not
> > > >>>>>> quite sure what the Spark target version for 1.11.0 might be.
> > Will
> > > >>>>> Parquet
> > > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > > >>>>>>>
> > > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > > >>>>>>>
> > > >>>>>>> …
> > > >>>>>>> D 0, localhost, executor driver):
> java.lang.NoClassDefFoundError:
> > > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>
> > >
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>>
> > > >>
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > > >>>>>>>     at org.apache.spark.internal.io
> > > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > > >>>>>>>     at
> > > >>>>>>
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > > >>>>>>>     at
> > > >>>>>>
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > > >>>>>>>     at
> > > >>>>>>
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > >>>>>>>     at
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > > >>>>>>>     at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > > >>>>>>>     at
> > > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > > >>>>>>>
> > > >>>>>>> michael
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <
> gabor@apache.org
> > >
> > > >>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>> Thanks, Fokko.
> > > >>>>>>>>
> > > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
> > > have
> > > >>>>>> enough
> > > >>>>>>>> time to do that in the next couple of weeks.
> > > >>>>>>>>
> > > >>>>>>>> Cheers,
> > > >>>>>>>> Gabor
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > > >>>>> <fokko@driesprong.frl
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote
> to
> > > +1
> > > >>>>>>>>> (non-binding).
> > > >>>>>>>>>
> > > >>>>>>>>> Cheers, Fokko
> > > >>>>>>>>>
> > > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > > >>>>>> <rb...@netflix.com.invalid>
> > > >>>>>>>>>
> > > >>>>>>>>>> Gabor, what I meant was: have we tried this with real data
> to
> > > see
> > > >>>>> the
> > > >>>>>>>>>> effect? I think those results would be helpful.
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > > >>> gabor@apache.org
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Ryan,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> It is not easy to calculate. For the column indexes feature
> > we
> > > >>>>>>>>> introduced
> > > >>>>>>>>>>> two new structures saved before the footer: column indexes
> > and
> > > >>>>> offset
> > > >>>>>>>>>>> indexes. If the min/max values are not too long, then the
> > > >>>>> truncation
> > > >>>>>>>>>> might
> > > >>>>>>>>>>> not decrease the file size because of the offset indexes.
> > > >>> Moreover,
> > > >>>>>> we
> > > >>>>>>>>>> also
> > > >>>>>>>>>>> introduced parquet.page.row.count.limit which might
> increase
> > > the
> > > >>>>>> number
> > > >>>>>>>>>> of
> > > >>>>>>>>>>> pages which leads to increasing the file size.
> > > >>>>>>>>>>> The footer itself is also changed and we are saving more
> > values
> > > >> in
> > > >>>>>> it:
> > > >>>>>>>>>> the
> > > >>>>>>>>>>> offset values to the column/offset indexes, the new logical
> > > type
> > > >>>>>>>>>>> structures, the CRC checksums (we might have some others).
> > > >>>>>>>>>>> So, the size of the files with small amount of data will be
> > > >>>>> increased
> > > >>>>>>>>>>> (because of the larger footer). The size of the files where
> > the
> > > >>>>>> values
> > > >>>>>>>>>> can
> > > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> > (because
> > > >> we
> > > >>>>>> will
> > > >>>>>>>>>> have
> > > >>>>>>>>>>> more pages). The size of some files where the values are
> long
> > > >>>>>> (>64bytes
> > > >>>>>>>>>> by
> > > >>>>>>>>>>> default) might be decreased because of truncating the
> min/max
> > > >>>>> values.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Regards,
> > > >>>>>>>>>>> Gabor
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > > >>>>> <rblue@netflix.com.invalid
> > > >>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> > > >>>>> non-test
> > > >>>>>>>>>> data
> > > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > > introduce
> > > >>> an
> > > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
> > > >>> actually
> > > >>>>>> be
> > > >>>>>>>>>>>> smaller since the column indexes are truncated and page
> > stats
> > > >> are
> > > >>>>>>>>> not.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Fokko,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> For the first point. The referenced constructor is
> private
> > > and
> > > >>>>>>>>>> Iceberg
> > > >>>>>>>>>>>> uses
> > > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
> > > >>>>> parquet-mr
> > > >>>>>>>>>>> shall
> > > >>>>>>>>>>>>> not keep private methods only because of clients might
> use
> > > >> them
> > > >>>>> via
> > > >>>>>>>>>>>>> reflection.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC
> checksum
> > > >> write
> > > >>>>>>>>>>> enabled
> > > >>>>>>>>>>>> by
> > > >>>>>>>>>>>>> default because the benchmarks did not show significant
> > > >>>>> performance
> > > >>>>>>>>>>>>> penalties. See
> > https://github.com/apache/parquet-mr/pull/647
> > > >>> for
> > > >>>>>>>>>>>> details.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
> > > >>> indexes,
> > > >>>>>>>>> CRC
> > > >>>>>>>>>>>>> checksum, removing the statistics from the page headers
> and
> > > >>> maybe
> > > >>>>>>>>>> other
> > > >>>>>>>>>>>>> changes that impact file size. If only file size is in
> > > >> question
> > > >>> I
> > > >>>>>>>>>>> cannot
> > > >>>>>>>>>>>>> see a breaking change here.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>> Gabor
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > > >>>>>>>>>> <fokko@driesprong.frl
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
> > > >> things:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - We've broken backward compatibility of the constructor
> > of
> > > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > >>>>>>>>>>>>>>> .
> > > >>>>>>>>>>>>>> This required a change
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there
> will
> > be
> > > >> a
> > > >>>>>>>>>> new
> > > >>>>>>>>>>>> RC,
> > > >>>>>>>>>>>>>> I've
> > > >>>>>>>>>>>>>> submitted a patch:
> > > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
> > > >>>>>>>>>> checksums
> > > >>>>>>>>>>>> are
> > > >>>>>>>>>>>>>> enabled by default:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > >>>>>>>>>>>>>> This
> > > >>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
> > > >>>>>>>>>> default:
> > > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > > >>>>>>>>>>>>>> <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> > noticed
> > > >>>>>>>>>> that
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> split-test was failing:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > >>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>> two records are now divided over four Spark partitions.
> > > >>>>>>>>>> Something
> > > >>>>>>>>>>> in
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>> output has changed since the files are bigger now. Has
> > > anyone
> > > >>>>>>>>>> any
> > > >>>>>>>>>>>> idea
> > > >>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> check what's changed, or a way to check this? The only
> > thing
> > > >> I
> > > >>>>>>>>>> can
> > > >>>>>>>>>>>>>> think of
> > > >>>>>>>>>>>>>> is the checksum mentioned above.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > >>>>>>>>>>>>>> id = 1
> > > >>>>>>>>>>>>>> data = a
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> $ parquet-tools cat
> > > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >>>>>>>>>>>>>> id = 1
> > > >>>>>>>>>>>>>> data = a
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> A binary diff here:
> > > >>>>>>>>>>>>>>
> > > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Cheers, Fokko
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > > >>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> +1
> > > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > > >> successfully.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > > >>>>>>>>> 下午2:05写道：
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> +1
> > > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > >>>>>>>>>>>>> "sql/test-only"
> > > >>>>>>>>>>>>>>> -Phadoop-3.2
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > > >> gabor@apache.org>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I propose the following RC to be released as official
> > > >>>>>>>>>> Apache
> > > >>>>>>>>>>>>>> Parquet
> > > >>>>>>>>>>>>>>> 1.11.0
> > > >>>>>>>>>>>>>>>> release.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The commit id is
> > 18519eb8e059865652eee3ff0e8593f126701da4
> > > >>>>>>>>>>>>>>>> * This corresponds to the tag:
> apache-parquet-1.11.0-rc7
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are
> here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > > >>>>>>>>>>>>>>>> *
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > > >>>>>>>>>>>>>>>> [ ] +0
> > > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> Ryan Blue
> > > >>>>>>>>>>>> Software Engineer
> > > >>>>>>>>>>>> Netflix
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>> Ryan Blue
> > > >>>>>>>>>> Software Engineer
> > > >>>>>>>>>> Netflix
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ryan Blue
> > > >>>> Software Engineer
> > > >>>> Netflix
> > > >>>
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor,

1.7.0 was the first version using the org.apache.parquet packages, so
that's the correct base version for compatibility checks. The exclusions in
the POM are classes that the Parquet community does not consider public. We
rely on these checks to highlight binary incompatibilities, and then we
discuss them on this list or in the dev sync. If the class is internal, we
add an exclusion for it.

I know you're familiar with this process since we've talked about it
before. I also know that you'd rather have more strict binary
compatibility, but until we have someone with the time to do some
maintenance and build a public API module, I'm afraid that's what we have
to work with.

Michael,

I hope the context above is helpful and explains why running a binary
compatibility check tool will find incompatible changes. We allow binary
incompatible changes to internal classes and modules, like parquet-common.

On Fri, Nov 22, 2019 at 12:23 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Ryan,
> I would not trust our compatibility checks (semver) too much. Currently, it
> is configured to compare our current version to 1.7.0. It means anything
> that is added since 1.7.0 and then broke in a later release won't be
> caught. In addition, many packages are excluded from the check because of
> different reasons. For example org/apache/parquet/schema/** is excluded so
> if it would really be an API compatibility issue we certainly wouldn't
> catch it.
>
> Michael,
> It fails because of a NoSuchMethodError pointing to a method that is newly
> introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
> So, I'm quite sure it is a classpath issue. It seems that the 1.11 version
> of the parquet-column jar is not on the classpath.
>
>
> On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com> wrote:
>
> > The dependency versions are consistent in our artifact
> >
> > $ mvn dependency:tree | grep parquet
> > [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> > [INFO] |     \-
> > org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> > [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> > [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> > [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> > [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> > [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
> >
> > The latter error
> >
> > Caused by: org.apache.spark.SparkException: Job aborted due to stage
> > failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost
> task
> > 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> > java.lang.NoSuchMethodError:
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >
> > occurs when I attempt to run via spark-submit on Spark 2.4.4
> >
> > $ spark-submit --version
> > Welcome to
> >       ____              __
> >      / __/__  ___ _____/ /__
> >     _\ \/ _ \/ _ `/ __/  '_/
> >    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
> >       /_/
> >
> > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_191
> > Branch
> > Compiled by user  on 2019-08-27T21:21:38Z
> > Revision
> > Url
> > Type --help for more information.
> >
> >
> >
> > > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > >
> > > Thanks for looking into it, Nandor. That doesn't sound like a problem
> > with
> > > Parquet, but a problem with the test environment since parquet-avro
> > depends
> > > on a newer API method.
> > >
> > > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> > <nk...@cloudera.com.invalid>
> > > wrote:
> > >
> > >> I'm not sure that this is a binary compatibility issue. The missing
> > builder
> > >> method was recently added in 1.11.0 with the introduction of the new
> > >> logical type API, while the original version (one with a single
> > >> OriginalType input parameter called before by AvroSchemaConverter) of
> > this
> > >> method is kept untouched. It seems to me that the Parquet version on
> > Spark
> > >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
> > still
> > >> on an older version.
> > >>
> > >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com>
> > wrote:
> > >>
> > >>> Perhaps not strictly necessary to say, but if this particular
> > >>> compatibility break between 1.10 and 1.11 was intentional, and no
> other
> > >>> compatibility breaks are found, I would vote -1 (non-binding) on this
> > RC
> > >>> such that we might go back and revisit the changes to preserve
> > >>> compatibility.
> > >>>
> > >>> I am not sure there is presently enough motivation in the Spark
> project
> > >>> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
> > >>> dependency version to 1.11.x.
> > >>>
> > >>>   michael
> > >>>
> > >>>
> > >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
> > >>> wrote:
> > >>>>
> > >>>> Gabor, shouldn't Parquet be binary compatible for public APIs? From
> > the
> > >>>> stack trace, it looks like this 1.11.0 RC breaks binary
> compatibility
> > >> in
> > >>>> the type builders.
> > >>>>
> > >>>> Looks like this should have been caught by the binary compatibility
> > >>> checks.
> > >>>>
> > >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
> > >>> wrote:
> > >>>>
> > >>>>> Hi Michael,
> > >>>>>
> > >>>>> Unfortunately, I don't have too much experience on Spark. But if
> > spark
> > >>> uses
> > >>>>> the parquet-mr library in an embedded way (that's how Hive uses it)
> > it
> > >>> is
> > >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Gabor
> > >>>>>
> > >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> It appears a provided scope dependency on spark-sql leaks old
> > parquet
> > >>>>>> versions was causing the runtime error below.  After including new
> > >>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
> > >>> addition
> > >>>>>> to parquet-avro, which we already have at compile scope), our
> build
> > >>>>>> succeeds.
> > >>>>>>
> > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> > >>>>>>
> > >>>>>> However, when running via spark-submit I run into a similar
> runtime
> > >>> error
> > >>>>>>
> > >>>>>> Caused by: java.lang.NoSuchMethodError:
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > >>>>>>       at
> > >>>>>>
> > >>>
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >>>>>>       at org.apache.spark.internal.io
> > >>>>>>
> > >>
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >>>>>>       at org.apache.spark.internal.io
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >>>>>>       at org.apache.spark.internal.io
> > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >>>>>>       at org.apache.spark.internal.io
> > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >>>>>>       at
> > >>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >>>>>>       at
> > >>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >>>>>>       at
> > >>>>>>
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>>>>>       at java.lang.Thread.run(Thread.java:748)
> > >>>>>>
> > >>>>>>
> > >>>>>> Will bumping our library dependency version to 1.11 require a new
> > >>> version
> > >>>>>> of Spark, built against Parquet 1.11?
> > >>>>>>
> > >>>>>> Please accept my apologies if this is heading out-of-scope for the
> > >>>>> Parquet
> > >>>>>> mailing list.
> > >>>>>>
> > >>>>>>  michael
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
> > >>> wrote:
> > >>>>>>>
> > >>>>>>> I am willing to do some benchmarking on genomic data at scale but
> > am
> > >>>>> not
> > >>>>>> quite sure what the Spark target version for 1.11.0 might be.
> Will
> > >>>>> Parquet
> > >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> > >>>>>>>
> > >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > >>>>>>>
> > >>>>>>> …
> > >>>>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> > >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> > >>>>>>>     at
> > >>>>>>
> > >>>
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >>>>>>>     at org.apache.spark.internal.io
> > >>>>>>
> > >>
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >>>>>>>     at org.apache.spark.internal.io
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >>>>>>>     at org.apache.spark.internal.io
> > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >>>>>>>     at org.apache.spark.internal.io
> > >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >>>>>>>     at
> > >>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >>>>>>>     at
> > >>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >>>>>>>     at
> > >>>>>>
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>>>>>>     at
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> > >>>>>>> Caused by: java.lang.ClassNotFoundException:
> > >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> > >>>>>>>     at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > >>>>>>>     at
> > >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > >>>>>>>
> > >>>>>>> michael
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <gabor@apache.org
> >
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Thanks, Fokko.
> > >>>>>>>>
> > >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
> > have
> > >>>>>> enough
> > >>>>>>>> time to do that in the next couple of weeks.
> > >>>>>>>>
> > >>>>>>>> Cheers,
> > >>>>>>>> Gabor
> > >>>>>>>>
> > >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > >>>>> <fokko@driesprong.frl
> > >>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to
> > +1
> > >>>>>>>>> (non-binding).
> > >>>>>>>>>
> > >>>>>>>>> Cheers, Fokko
> > >>>>>>>>>
> > >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > >>>>>> <rb...@netflix.com.invalid>
> > >>>>>>>>>
> > >>>>>>>>>> Gabor, what I meant was: have we tried this with real data to
> > see
> > >>>>> the
> > >>>>>>>>>> effect? I think those results would be helpful.
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > >>> gabor@apache.org
> > >>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Ryan,
> > >>>>>>>>>>>
> > >>>>>>>>>>> It is not easy to calculate. For the column indexes feature
> we
> > >>>>>>>>> introduced
> > >>>>>>>>>>> two new structures saved before the footer: column indexes
> and
> > >>>>> offset
> > >>>>>>>>>>> indexes. If the min/max values are not too long, then the
> > >>>>> truncation
> > >>>>>>>>>> might
> > >>>>>>>>>>> not decrease the file size because of the offset indexes.
> > >>> Moreover,
> > >>>>>> we
> > >>>>>>>>>> also
> > >>>>>>>>>>> introduced parquet.page.row.count.limit which might increase
> > the
> > >>>>>> number
> > >>>>>>>>>> of
> > >>>>>>>>>>> pages which leads to increasing the file size.
> > >>>>>>>>>>> The footer itself is also changed and we are saving more
> values
> > >> in
> > >>>>>> it:
> > >>>>>>>>>> the
> > >>>>>>>>>>> offset values to the column/offset indexes, the new logical
> > type
> > >>>>>>>>>>> structures, the CRC checksums (we might have some others).
> > >>>>>>>>>>> So, the size of the files with small amount of data will be
> > >>>>> increased
> > >>>>>>>>>>> (because of the larger footer). The size of the files where
> the
> > >>>>>> values
> > >>>>>>>>>> can
> > >>>>>>>>>>> be encoded very well (RLE) will probably be increased
> (because
> > >> we
> > >>>>>> will
> > >>>>>>>>>> have
> > >>>>>>>>>>> more pages). The size of some files where the values are long
> > >>>>>> (>64bytes
> > >>>>>>>>>> by
> > >>>>>>>>>>> default) might be decreased because of truncating the min/max
> > >>>>> values.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Gabor
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > >>>>> <rblue@netflix.com.invalid
> > >>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> > >>>>> non-test
> > >>>>>>>>>> data
> > >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> > introduce
> > >>> an
> > >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
> > >>> actually
> > >>>>>> be
> > >>>>>>>>>>>> smaller since the column indexes are truncated and page
> stats
> > >> are
> > >>>>>>>>> not.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Fokko,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For the first point. The referenced constructor is private
> > and
> > >>>>>>>>>> Iceberg
> > >>>>>>>>>>>> uses
> > >>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
> > >>>>> parquet-mr
> > >>>>>>>>>>> shall
> > >>>>>>>>>>>>> not keep private methods only because of clients might use
> > >> them
> > >>>>> via
> > >>>>>>>>>>>>> reflection.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
> > >> write
> > >>>>>>>>>>> enabled
> > >>>>>>>>>>>> by
> > >>>>>>>>>>>>> default because the benchmarks did not show significant
> > >>>>> performance
> > >>>>>>>>>>>>> penalties. See
> https://github.com/apache/parquet-mr/pull/647
> > >>> for
> > >>>>>>>>>>>> details.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
> > >>> indexes,
> > >>>>>>>>> CRC
> > >>>>>>>>>>>>> checksum, removing the statistics from the page headers and
> > >>> maybe
> > >>>>>>>>>> other
> > >>>>>>>>>>>>> changes that impact file size. If only file size is in
> > >> question
> > >>> I
> > >>>>>>>>>>> cannot
> > >>>>>>>>>>>>> see a breaking change here.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>> Gabor
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > >>>>>>>>>> <fokko@driesprong.frl
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
> > >> things:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - We've broken backward compatibility of the constructor
> of
> > >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> > >>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > >>>>>>>>>>>>>>> .
> > >>>>>>>>>>>>>> This required a change
> > >>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will
> be
> > >> a
> > >>>>>>>>>> new
> > >>>>>>>>>>>> RC,
> > >>>>>>>>>>>>>> I've
> > >>>>>>>>>>>>>> submitted a patch:
> > >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> > >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
> > >>>>>>>>>> checksums
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>>> enabled by default:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > >>>>>>>>>>>>>> This
> > >>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
> > >>>>>>>>>> default:
> > >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > >>>>>>>>>>>>>> <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've
> noticed
> > >>>>>>>>>> that
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>> split-test was failing:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > >>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>> two records are now divided over four Spark partitions.
> > >>>>>>>>>> Something
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> output has changed since the files are bigger now. Has
> > anyone
> > >>>>>>>>>> any
> > >>>>>>>>>>>> idea
> > >>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>> check what's changed, or a way to check this? The only
> thing
> > >> I
> > >>>>>>>>>> can
> > >>>>>>>>>>>>>> think of
> > >>>>>>>>>>>>>> is the checksum mentioned above.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> $ parquet-tools cat
> > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>>>>>>>> id = 1
> > >>>>>>>>>>>>>> data = a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> $ parquet-tools cat
> > >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>>>>>>> id = 1
> > >>>>>>>>>>>>>> data = a
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> A binary diff here:
> > >>>>>>>>>>>>>>
> > >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Cheers, Fokko
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > >>>>>>>>>>>>> chenjunjiedada@gmail.com
> > >>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> +1
> > >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> > >> successfully.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > >>>>>>>>> 下午2:05写道：
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> +1
> > >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > >>>>>>>>>>>>> "sql/test-only"
> > >>>>>>>>>>>>>>> -Phadoop-3.2
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> > >> gabor@apache.org>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I propose the following RC to be released as official
> > >>>>>>>>>> Apache
> > >>>>>>>>>>>>>> Parquet
> > >>>>>>>>>>>>>>> 1.11.0
> > >>>>>>>>>>>>>>>> release.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The commit id is
> 18519eb8e059865652eee3ff0e8593f126701da4
> > >>>>>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > >>>>>>>>>>>>>>>> *
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are here:
> > >>>>>>>>>>>>>>>> *
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> You can find the KEYS file here:
> > >>>>>>>>>>>>>>>> *
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> > >>>>>>>>>>>>>>>> *
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> This release includes the changes listed at:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Please download, verify, and test.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> > >>>>>>>>>>>>>>>> [ ] +0
> > >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Ryan Blue
> > >>>>>>>>>>>> Software Engineer
> > >>>>>>>>>>>> Netflix
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Ryan Blue
> > >>>>>>>>>> Software Engineer
> > >>>>>>>>>> Netflix
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ryan Blue
> > >>>> Software Engineer
> > >>>> Netflix
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Ryan,
I would not trust our compatibility checks (semver) too much. Currently, it
is configured to compare our current version to 1.7.0. It means anything
that is added since 1.7.0 and then broke in a later release won't be
caught. In addition, many packages are excluded from the check because of
different reasons. For example org/apache/parquet/schema/** is excluded so
if it would really be an API compatibility issue we certainly wouldn't
catch it.

Michael,
It fails because of a NoSuchMethodError pointing to a method that is newly
introduced in 1.11. Both the caller and the callee shipped by parquet-mr.
So, I'm quite sure it is a classpath issue. It seems that the 1.11 version
of the parquet-column jar is not on the classpath.


On Fri, Nov 22, 2019 at 1:44 AM Michael Heuer <he...@gmail.com> wrote:

> The dependency versions are consistent in our artifact
>
> $ mvn dependency:tree | grep parquet
> [INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
> [INFO] |     \-
> org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
> [INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
> [INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
> [INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
> [INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
> [INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile
>
> The latter error
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task
> 0.0 in stage 0.0 (TID 0, localhost, executor driver):
> java.lang.NoSuchMethodError:
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>
> occurs when I attempt to run via spark-submit on Spark 2.4.4
>
> $ spark-submit --version
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>       /_/
>
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_191
> Branch
> Compiled by user  on 2019-08-27T21:21:38Z
> Revision
> Url
> Type --help for more information.
>
>
>
> > On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >
> > Thanks for looking into it, Nandor. That doesn't sound like a problem
> with
> > Parquet, but a problem with the test environment since parquet-avro
> depends
> > on a newer API method.
> >
> > On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar
> <nk...@cloudera.com.invalid>
> > wrote:
> >
> >> I'm not sure that this is a binary compatibility issue. The missing
> builder
> >> method was recently added in 1.11.0 with the introduction of the new
> >> logical type API, while the original version (one with a single
> >> OriginalType input parameter called before by AvroSchemaConverter) of
> this
> >> method is kept untouched. It seems to me that the Parquet version on
> Spark
> >> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is
> still
> >> on an older version.
> >>
> >> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com>
> wrote:
> >>
> >>> Perhaps not strictly necessary to say, but if this particular
> >>> compatibility break between 1.10 and 1.11 was intentional, and no other
> >>> compatibility breaks are found, I would vote -1 (non-binding) on this
> RC
> >>> such that we might go back and revisit the changes to preserve
> >>> compatibility.
> >>>
> >>> I am not sure there is presently enough motivation in the Spark project
> >>> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
> >>> dependency version to 1.11.x.
> >>>
> >>>   michael
> >>>
> >>>
> >>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
> >>> wrote:
> >>>>
> >>>> Gabor, shouldn't Parquet be binary compatible for public APIs? From
> the
> >>>> stack trace, it looks like this 1.11.0 RC breaks binary compatibility
> >> in
> >>>> the type builders.
> >>>>
> >>>> Looks like this should have been caught by the binary compatibility
> >>> checks.
> >>>>
> >>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
> >>> wrote:
> >>>>
> >>>>> Hi Michael,
> >>>>>
> >>>>> Unfortunately, I don't have too much experience on Spark. But if
> spark
> >>> uses
> >>>>> the parquet-mr library in an embedded way (that's how Hive uses it)
> it
> >>> is
> >>>>> required to re-build Spark with 1.11 RC parquet-mr.
> >>>>>
> >>>>> Regards,
> >>>>> Gabor
> >>>>>
> >>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> It appears a provided scope dependency on spark-sql leaks old
> parquet
> >>>>>> versions was causing the runtime error below.  After including new
> >>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
> >>> addition
> >>>>>> to parquet-avro, which we already have at compile scope), our build
> >>>>>> succeeds.
> >>>>>>
> >>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
> >>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
> >>>>>>
> >>>>>> However, when running via spark-submit I run into a similar runtime
> >>> error
> >>>>>>
> >>>>>> Caused by: java.lang.NoSuchMethodError:
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> >>>>>>       at
> >>>>>>
> >>>
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >>>>>>       at org.apache.spark.internal.io
> >>>>>>
> >> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >>>>>>       at org.apache.spark.internal.io
> >>>>>>
> >>>>>
> >>>
> >>
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >>>>>>       at org.apache.spark.internal.io
> >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >>>>>>       at org.apache.spark.internal.io
> >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >>>>>>       at
> >>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >>>>>>       at
> >>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >>>>>>       at
> >>>>>>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>>>>       at
> >>>>>>
> >>>>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>>>>       at java.lang.Thread.run(Thread.java:748)
> >>>>>>
> >>>>>>
> >>>>>> Will bumping our library dependency version to 1.11 require a new
> >>> version
> >>>>>> of Spark, built against Parquet 1.11?
> >>>>>>
> >>>>>> Please accept my apologies if this is heading out-of-scope for the
> >>>>> Parquet
> >>>>>> mailing list.
> >>>>>>
> >>>>>>  michael
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
> >>> wrote:
> >>>>>>>
> >>>>>>> I am willing to do some benchmarking on genomic data at scale but
> am
> >>>>> not
> >>>>>> quite sure what the Spark target version for 1.11.0 might be.  Will
> >>>>> Parquet
> >>>>>> 1.11.0 be compatible in Spark 2.4.x?
> >>>>>>>
> >>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> >>>>>>>
> >>>>>>> …
> >>>>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> >>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
> >>>>>>>     at
> >>>>>>
> >>>
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >>>>>>>     at org.apache.spark.internal.io
> >>>>>>
> >> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >>>>>>>     at org.apache.spark.internal.io
> >>>>>>
> >>>>>
> >>>
> >>
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >>>>>>>     at org.apache.spark.internal.io
> >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >>>>>>>     at org.apache.spark.internal.io
> >>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >>>>>>>     at
> >>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >>>>>>>     at
> >>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >>>>>>>     at
> >>>>>>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>>>>>     at
> >>>>>>
> >>>>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>>>>>     at java.lang.Thread.run(Thread.java:748)
> >>>>>>> Caused by: java.lang.ClassNotFoundException:
> >>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
> >>>>>>>     at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >>>>>>>     at
> >> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> >>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >>>>>>>
> >>>>>>> michael
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> Thanks, Fokko.
> >>>>>>>>
> >>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't
> have
> >>>>>> enough
> >>>>>>>> time to do that in the next couple of weeks.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Gabor
> >>>>>>>>
> >>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> >>>>> <fokko@driesprong.frl
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to
> +1
> >>>>>>>>> (non-binding).
> >>>>>>>>>
> >>>>>>>>> Cheers, Fokko
> >>>>>>>>>
> >>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> >>>>>> <rb...@netflix.com.invalid>
> >>>>>>>>>
> >>>>>>>>>> Gabor, what I meant was: have we tried this with real data to
> see
> >>>>> the
> >>>>>>>>>> effect? I think those results would be helpful.
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> >>> gabor@apache.org
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Ryan,
> >>>>>>>>>>>
> >>>>>>>>>>> It is not easy to calculate. For the column indexes feature we
> >>>>>>>>> introduced
> >>>>>>>>>>> two new structures saved before the footer: column indexes and
> >>>>> offset
> >>>>>>>>>>> indexes. If the min/max values are not too long, then the
> >>>>> truncation
> >>>>>>>>>> might
> >>>>>>>>>>> not decrease the file size because of the offset indexes.
> >>> Moreover,
> >>>>>> we
> >>>>>>>>>> also
> >>>>>>>>>>> introduced parquet.page.row.count.limit which might increase
> the
> >>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>> pages which leads to increasing the file size.
> >>>>>>>>>>> The footer itself is also changed and we are saving more values
> >> in
> >>>>>> it:
> >>>>>>>>>> the
> >>>>>>>>>>> offset values to the column/offset indexes, the new logical
> type
> >>>>>>>>>>> structures, the CRC checksums (we might have some others).
> >>>>>>>>>>> So, the size of the files with small amount of data will be
> >>>>> increased
> >>>>>>>>>>> (because of the larger footer). The size of the files where the
> >>>>>> values
> >>>>>>>>>> can
> >>>>>>>>>>> be encoded very well (RLE) will probably be increased (because
> >> we
> >>>>>> will
> >>>>>>>>>> have
> >>>>>>>>>>> more pages). The size of some files where the values are long
> >>>>>> (>64bytes
> >>>>>>>>>> by
> >>>>>>>>>>> default) might be decreased because of truncating the min/max
> >>>>> values.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Gabor
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> >>>>> <rblue@netflix.com.invalid
> >>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> >>>>> non-test
> >>>>>>>>>> data
> >>>>>>>>>>>> file? It should be easy to validate that this doesn't
> introduce
> >>> an
> >>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
> >>> actually
> >>>>>> be
> >>>>>>>>>>>> smaller since the column indexes are truncated and page stats
> >> are
> >>>>>>>>> not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> >>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Fokko,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For the first point. The referenced constructor is private
> and
> >>>>>>>>>> Iceberg
> >>>>>>>>>>>> uses
> >>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
> >>>>> parquet-mr
> >>>>>>>>>>> shall
> >>>>>>>>>>>>> not keep private methods only because of clients might use
> >> them
> >>>>> via
> >>>>>>>>>>>>> reflection.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
> >> write
> >>>>>>>>>>> enabled
> >>>>>>>>>>>> by
> >>>>>>>>>>>>> default because the benchmarks did not show significant
> >>>>> performance
> >>>>>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
> >>> for
> >>>>>>>>>>>> details.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
> >>> indexes,
> >>>>>>>>> CRC
> >>>>>>>>>>>>> checksum, removing the statistics from the page headers and
> >>> maybe
> >>>>>>>>>> other
> >>>>>>>>>>>>> changes that impact file size. If only file size is in
> >> question
> >>> I
> >>>>>>>>>>> cannot
> >>>>>>>>>>>>> see a breaking change here.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Gabor
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> >>>>>>>>>> <fokko@driesprong.frl
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
> >> things:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - We've broken backward compatibility of the constructor of
> >>>>>>>>>>>>>> ColumnChunkPageWriteStore
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> >>>>>>>>>>>>>>> .
> >>>>>>>>>>>>>> This required a change
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be
> >> a
> >>>>>>>>>> new
> >>>>>>>>>>>> RC,
> >>>>>>>>>>>>>> I've
> >>>>>>>>>>>>>> submitted a patch:
> >>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
> >>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
> >>>>>>>>>> checksums
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> enabled by default:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> >>>>>>>>>>>>>> This
> >>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
> >>>>>>>>>> default:
> >>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> >>>>>>>>>>>>>> <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
> >>>>>>>>>> that
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> split-test was failing:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> >>>>>>>>>>>>>> The
> >>>>>>>>>>>>>> two records are now divided over four Spark partitions.
> >>>>>>>>>> Something
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> output has changed since the files are bigger now. Has
> anyone
> >>>>>>>>>> any
> >>>>>>>>>>>> idea
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>> check what's changed, or a way to check this? The only thing
> >> I
> >>>>>>>>>> can
> >>>>>>>>>>>>>> think of
> >>>>>>>>>>>>>> is the checksum mentioned above.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> >>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> $ parquet-tools cat
> >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>>>>>>>> id = 1
> >>>>>>>>>>>>>> data = a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> $ parquet-tools cat
> >>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>>>>>>> id = 1
> >>>>>>>>>>>>>> data = a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> A binary diff here:
> >>>>>>>>>>>>>>
> >> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers, Fokko
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> >>>>>>>>>>>>> chenjunjiedada@gmail.com
> >>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
> >> successfully.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> >>>>>>>>> 下午2:05写道：
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1
> >>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> >>>>>>>>>>>>> "sql/test-only"
> >>>>>>>>>>>>>>> -Phadoop-3.2
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> >> gabor@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I propose the following RC to be released as official
> >>>>>>>>>> Apache
> >>>>>>>>>>>>>> Parquet
> >>>>>>>>>>>>>>> 1.11.0
> >>>>>>>>>>>>>>>> release.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> >>>>>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7
> >>>>>>>>>>>>>>>> *
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The release tarball, signature, and checksums are here:
> >>>>>>>>>>>>>>>> *
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> You can find the KEYS file here:
> >>>>>>>>>>>>>>>> *
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
> >>>>>>>>>>>>>>>> *
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This release includes the changes listed at:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Please download, verify, and test.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Please vote in the next 72 hours.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
> >>>>>>>>>>>>>>>> [ ] +0
> >>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Ryan Blue
> >>>>>>>>>>>> Software Engineer
> >>>>>>>>>>>> Netflix
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Ryan Blue
> >>>>>>>>>> Software Engineer
> >>>>>>>>>> Netflix
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>
> >>>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Michael Heuer <he...@gmail.com>.

The dependency versions are consistent in our artifact

$ mvn dependency:tree | grep parquet
[INFO] |  \- org.apache.parquet:parquet-avro:jar:1.11.0:compile
[INFO] |     \- org.apache.parquet:parquet-format-structures:jar:1.11.0:compile
[INFO] |  +- org.apache.parquet:parquet-column:jar:1.11.0:compile
[INFO] |  |  +- org.apache.parquet:parquet-common:jar:1.11.0:compile
[INFO] |  |  \- org.apache.parquet:parquet-encoding:jar:1.11.0:compile
[INFO] |  +- org.apache.parquet:parquet-hadoop:jar:1.11.0:compile
[INFO] |  |  +- org.apache.parquet:parquet-jackson:jar:1.11.0:compile

The latter error

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)

occurs when I attempt to run via spark-submit on Spark 2.4.4

$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_191
Branch
Compiled by user  on 2019-08-27T21:21:38Z
Revision
Url
Type --help for more information.



> On Nov 21, 2019, at 6:06 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Thanks for looking into it, Nandor. That doesn't sound like a problem with
> Parquet, but a problem with the test environment since parquet-avro depends
> on a newer API method.
> 
> On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar <nk...@cloudera.com.invalid>
> wrote:
> 
>> I'm not sure that this is a binary compatibility issue. The missing builder
>> method was recently added in 1.11.0 with the introduction of the new
>> logical type API, while the original version (one with a single
>> OriginalType input parameter called before by AvroSchemaConverter) of this
>> method is kept untouched. It seems to me that the Parquet version on Spark
>> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is still
>> on an older version.
>> 
>> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com> wrote:
>> 
>>> Perhaps not strictly necessary to say, but if this particular
>>> compatibility break between 1.10 and 1.11 was intentional, and no other
>>> compatibility breaks are found, I would vote -1 (non-binding) on this RC
>>> such that we might go back and revisit the changes to preserve
>>> compatibility.
>>> 
>>> I am not sure there is presently enough motivation in the Spark project
>>> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
>>> dependency version to 1.11.x.
>>> 
>>>   michael
>>> 
>>> 
>>>> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
>>> wrote:
>>>> 
>>>> Gabor, shouldn't Parquet be binary compatible for public APIs? From the
>>>> stack trace, it looks like this 1.11.0 RC breaks binary compatibility
>> in
>>>> the type builders.
>>>> 
>>>> Looks like this should have been caught by the binary compatibility
>>> checks.
>>>> 
>>>> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
>>> wrote:
>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Unfortunately, I don't have too much experience on Spark. But if spark
>>> uses
>>>>> the parquet-mr library in an embedded way (that's how Hive uses it) it
>>> is
>>>>> required to re-build Spark with 1.11 RC parquet-mr.
>>>>> 
>>>>> Regards,
>>>>> Gabor
>>>>> 
>>>>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> It appears a provided scope dependency on spark-sql leaks old parquet
>>>>>> versions was causing the runtime error below.  After including new
>>>>>> parquet-column and parquet-hadoop compile scope dependencies (in
>>> addition
>>>>>> to parquet-avro, which we already have at compile scope), our build
>>>>>> succeeds.
>>>>>> 
>>>>>> https://github.com/bigdatagenomics/adam/pull/2232 <
>>>>>> https://github.com/bigdatagenomics/adam/pull/2232>
>>>>>> 
>>>>>> However, when running via spark-submit I run into a similar runtime
>>> error
>>>>>> 
>>>>>> Caused by: java.lang.NoSuchMethodError:
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>>>>>>       at
>>>>>> 
>>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>       at org.apache.spark.internal.io
>>>>>> 
>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>       at org.apache.spark.internal.io
>>>>>> 
>>>>> 
>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>       at org.apache.spark.internal.io
>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>       at org.apache.spark.internal.io
>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>       at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>       at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>       at
>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>       at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>       at
>>>>>> 
>>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>       at java.lang.Thread.run(Thread.java:748)
>>>>>> 
>>>>>> 
>>>>>> Will bumping our library dependency version to 1.11 require a new
>>> version
>>>>>> of Spark, built against Parquet 1.11?
>>>>>> 
>>>>>> Please accept my apologies if this is heading out-of-scope for the
>>>>> Parquet
>>>>>> mailing list.
>>>>>> 
>>>>>>  michael
>>>>>> 
>>>>>> 
>>>>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
>>> wrote:
>>>>>>> 
>>>>>>> I am willing to do some benchmarking on genomic data at scale but am
>>>>> not
>>>>>> quite sure what the Spark target version for 1.11.0 might be.  Will
>>>>> Parquet
>>>>>> 1.11.0 be compatible in Spark 2.4.x?
>>>>>>> 
>>>>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
>>>>>>> 
>>>>>>> …
>>>>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
>>>>>> org/apache/parquet/schema/LogicalTypeAnnotation
>>>>>>>     at
>>>>>> 
>>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>>>>     at org.apache.spark.internal.io
>>>>>> 
>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>>>>     at org.apache.spark.internal.io
>>>>>> 
>>>>> 
>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>>>>     at org.apache.spark.internal.io
>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>>>>     at org.apache.spark.internal.io
>>>>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>>>>     at
>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>>>>     at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>>>>     at
>>>>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>>>>     at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>     at
>>>>>> 
>>>>> 
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>     at java.lang.Thread.run(Thread.java:748)
>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>> org.apache.parquet.schema.LogicalTypeAnnotation
>>>>>>>     at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>>     at
>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>>>>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>> 
>>>>>>> michael
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> Thanks, Fokko.
>>>>>>>> 
>>>>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't have
>>>>>> enough
>>>>>>>> time to do that in the next couple of weeks.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Gabor
>>>>>>>> 
>>>>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
>>>>> <fokko@driesprong.frl
>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>>>>>>>>> (non-binding).
>>>>>>>>> 
>>>>>>>>> Cheers, Fokko
>>>>>>>>> 
>>>>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
>>>>>> <rb...@netflix.com.invalid>
>>>>>>>>> 
>>>>>>>>>> Gabor, what I meant was: have we tried this with real data to see
>>>>> the
>>>>>>>>>> effect? I think those results would be helpful.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
>>> gabor@apache.org
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>> 
>>>>>>>>>>> It is not easy to calculate. For the column indexes feature we
>>>>>>>>> introduced
>>>>>>>>>>> two new structures saved before the footer: column indexes and
>>>>> offset
>>>>>>>>>>> indexes. If the min/max values are not too long, then the
>>>>> truncation
>>>>>>>>>> might
>>>>>>>>>>> not decrease the file size because of the offset indexes.
>>> Moreover,
>>>>>> we
>>>>>>>>>> also
>>>>>>>>>>> introduced parquet.page.row.count.limit which might increase the
>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>> pages which leads to increasing the file size.
>>>>>>>>>>> The footer itself is also changed and we are saving more values
>> in
>>>>>> it:
>>>>>>>>>> the
>>>>>>>>>>> offset values to the column/offset indexes, the new logical type
>>>>>>>>>>> structures, the CRC checksums (we might have some others).
>>>>>>>>>>> So, the size of the files with small amount of data will be
>>>>> increased
>>>>>>>>>>> (because of the larger footer). The size of the files where the
>>>>>> values
>>>>>>>>>> can
>>>>>>>>>>> be encoded very well (RLE) will probably be increased (because
>> we
>>>>>> will
>>>>>>>>>> have
>>>>>>>>>>> more pages). The size of some files where the values are long
>>>>>> (>64bytes
>>>>>>>>>> by
>>>>>>>>>>> default) might be decreased because of truncating the min/max
>>>>> values.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Gabor
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
>>>>> <rblue@netflix.com.invalid
>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
>>>>> non-test
>>>>>>>>>> data
>>>>>>>>>>>> file? It should be easy to validate that this doesn't introduce
>>> an
>>>>>>>>>>>> unreasonable amount of overhead. In some cases, it should
>>> actually
>>>>>> be
>>>>>>>>>>>> smaller since the column indexes are truncated and page stats
>> are
>>>>>>>>> not.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Fokko,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the first point. The referenced constructor is private and
>>>>>>>>>> Iceberg
>>>>>>>>>>>> uses
>>>>>>>>>>>>> it via reflection. It is not a breaking change. I think,
>>>>> parquet-mr
>>>>>>>>>>> shall
>>>>>>>>>>>>> not keep private methods only because of clients might use
>> them
>>>>> via
>>>>>>>>>>>>> reflection.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
>> write
>>>>>>>>>>> enabled
>>>>>>>>>>>> by
>>>>>>>>>>>>> default because the benchmarks did not show significant
>>>>> performance
>>>>>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
>>> for
>>>>>>>>>>>> details.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> About the file size change. 1.11.0 is introducing column
>>> indexes,
>>>>>>>>> CRC
>>>>>>>>>>>>> checksum, removing the statistics from the page headers and
>>> maybe
>>>>>>>>>> other
>>>>>>>>>>>>> changes that impact file size. If only file size is in
>> question
>>> I
>>>>>>>>>>> cannot
>>>>>>>>>>>>> see a breaking change here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Gabor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>>>>>>>> <fokko@driesprong.frl
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
>> things:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - We've broken backward compatibility of the constructor of
>>>>>>>>>>>>>> ColumnChunkPageWriteStore
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>> This required a change
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be
>> a
>>>>>>>>>> new
>>>>>>>>>>>> RC,
>>>>>>>>>>>>>> I've
>>>>>>>>>>>>>> submitted a patch:
>>>>>>>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>>>>>>> - Related, that we need to put in the changelog, is that
>>>>>>>>>> checksums
>>>>>>>>>>>> are
>>>>>>>>>>>>>> enabled by default:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>> will impact performance. I would suggest disabling it by
>>>>>>>>>> default:
>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
>>>>>>>>>>>>>> <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
>>>>>>>>>> that
>>>>>>>>>>>> the
>>>>>>>>>>>>>> split-test was failing:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> two records are now divided over four Spark partitions.
>>>>>>>>>> Something
>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> output has changed since the files are bigger now. Has anyone
>>>>>>>>>> any
>>>>>>>>>>>> idea
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> check what's changed, or a way to check this? The only thing
>> I
>>>>>>>>>> can
>>>>>>>>>>>>>> think of
>>>>>>>>>>>>>> is the checksum mentioned above.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>>>>> id = 1
>>>>>>>>>>>>>> data = a
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A binary diff here:
>>>>>>>>>>>>>> 
>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>>>>>>>> chenjunjiedada@gmail.com
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> Verified signature, checksum and ran mvn install
>> successfully.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>>>>>>>>> 下午2:05写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>>>>>>>> "sql/test-only"
>>>>>>>>>>>>>>> -Phadoop-3.2
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
>> gabor@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I propose the following RC to be released as official
>>>>>>>>>> Apache
>>>>>>>>>>>>>> Parquet
>>>>>>>>>>>>>>> 1.11.0
>>>>>>>>>>>>>>>> release.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>>>>>>> * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The release tarball, signature, and checksums are here:
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can find the KEYS file here:
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Binary artifacts are staged in Nexus here:
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This release includes the changes listed at:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Please download, verify, and test.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Please vote in the next 72 hours.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>>>>> [ ] -1 Do not release this because...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>> Netflix
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Thanks for looking into it, Nandor. That doesn't sound like a problem with
Parquet, but a problem with the test environment since parquet-avro depends
on a newer API method.

On Thu, Nov 21, 2019 at 3:58 PM Nandor Kollar <nk...@cloudera.com.invalid>
wrote:

> I'm not sure that this is a binary compatibility issue. The missing builder
> method was recently added in 1.11.0 with the introduction of the new
> logical type API, while the original version (one with a single
> OriginalType input parameter called before by AvroSchemaConverter) of this
> method is kept untouched. It seems to me that the Parquet version on Spark
> executor mismatch: parquet-avro is on 1.11.0, but parquet-column is still
> on an older version.
>
> On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com> wrote:
>
> > Perhaps not strictly necessary to say, but if this particular
> > compatibility break between 1.10 and 1.11 was intentional, and no other
> > compatibility breaks are found, I would vote -1 (non-binding) on this RC
> > such that we might go back and revisit the changes to preserve
> > compatibility.
> >
> > I am not sure there is presently enough motivation in the Spark project
> > for a release after 2.4.4 and before 3.0 in which to bump the Parquet
> > dependency version to 1.11.x.
> >
> >    michael
> >
> >
> > > On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > >
> > > Gabor, shouldn't Parquet be binary compatible for public APIs? From the
> > > stack trace, it looks like this 1.11.0 RC breaks binary compatibility
> in
> > > the type builders.
> > >
> > > Looks like this should have been caught by the binary compatibility
> > checks.
> > >
> > > On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > >> Hi Michael,
> > >>
> > >> Unfortunately, I don't have too much experience on Spark. But if spark
> > uses
> > >> the parquet-mr library in an embedded way (that's how Hive uses it) it
> > is
> > >> required to re-build Spark with 1.11 RC parquet-mr.
> > >>
> > >> Regards,
> > >> Gabor
> > >>
> > >> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
> > wrote:
> > >>
> > >>> It appears a provided scope dependency on spark-sql leaks old parquet
> > >>> versions was causing the runtime error below.  After including new
> > >>> parquet-column and parquet-hadoop compile scope dependencies (in
> > addition
> > >>> to parquet-avro, which we already have at compile scope), our build
> > >>> succeeds.
> > >>>
> > >>> https://github.com/bigdatagenomics/adam/pull/2232 <
> > >>> https://github.com/bigdatagenomics/adam/pull/2232>
> > >>>
> > >>> However, when running via spark-submit I run into a similar runtime
> > error
> > >>>
> > >>> Caused by: java.lang.NoSuchMethodError:
> > >>>
> > >>
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> > >>>        at
> > >>>
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >>>        at org.apache.spark.internal.io
> > >>>
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >>>        at org.apache.spark.internal.io
> > >>>
> > >>
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >>>        at org.apache.spark.internal.io
> > >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >>>        at org.apache.spark.internal.io
> > >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >>>        at
> > >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >>>        at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >>>        at
> > >>>
> > >>
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >>>        at
> > >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >>>        at
> > >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >>>        at
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>>        at
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>>        at java.lang.Thread.run(Thread.java:748)
> > >>>
> > >>>
> > >>> Will bumping our library dependency version to 1.11 require a new
> > version
> > >>> of Spark, built against Parquet 1.11?
> > >>>
> > >>> Please accept my apologies if this is heading out-of-scope for the
> > >> Parquet
> > >>> mailing list.
> > >>>
> > >>>   michael
> > >>>
> > >>>
> > >>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
> > wrote:
> > >>>>
> > >>>> I am willing to do some benchmarking on genomic data at scale but am
> > >> not
> > >>> quite sure what the Spark target version for 1.11.0 might be.  Will
> > >> Parquet
> > >>> 1.11.0 be compatible in Spark 2.4.x?
> > >>>>
> > >>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > >>>>
> > >>>> …
> > >>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> > >>> org/apache/parquet/schema/LogicalTypeAnnotation
> > >>>>      at
> > >>>
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >>>>      at
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >>>>      at
> > >>>
> > >>
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >>>>      at
> > >>>
> > >>
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >>>>      at org.apache.spark.internal.io
> > >>>
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >>>>      at org.apache.spark.internal.io
> > >>>
> > >>
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >>>>      at org.apache.spark.internal.io
> > >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >>>>      at org.apache.spark.internal.io
> > >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >>>>      at
> > >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >>>>      at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >>>>      at
> > >>>
> > >>
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >>>>      at
> > >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >>>>      at
> > >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >>>>      at
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >>>>      at
> > >>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >>>>      at java.lang.Thread.run(Thread.java:748)
> > >>>> Caused by: java.lang.ClassNotFoundException:
> > >>> org.apache.parquet.schema.LogicalTypeAnnotation
> > >>>>      at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > >>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > >>>>      at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > >>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > >>>>
> > >>>>  michael
> > >>>>
> > >>>>
> > >>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
> > >> wrote:
> > >>>>>
> > >>>>> Thanks, Fokko.
> > >>>>>
> > >>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> > >>> enough
> > >>>>> time to do that in the next couple of weeks.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Gabor
> > >>>>>
> > >>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> > >> <fokko@driesprong.frl
> > >>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> > >>>>>> (non-binding).
> > >>>>>>
> > >>>>>> Cheers, Fokko
> > >>>>>>
> > >>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > >>> <rb...@netflix.com.invalid>
> > >>>>>>
> > >>>>>>> Gabor, what I meant was: have we tried this with real data to see
> > >> the
> > >>>>>>> effect? I think those results would be helpful.
> > >>>>>>>
> > >>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> > gabor@apache.org
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Ryan,
> > >>>>>>>>
> > >>>>>>>> It is not easy to calculate. For the column indexes feature we
> > >>>>>> introduced
> > >>>>>>>> two new structures saved before the footer: column indexes and
> > >> offset
> > >>>>>>>> indexes. If the min/max values are not too long, then the
> > >> truncation
> > >>>>>>> might
> > >>>>>>>> not decrease the file size because of the offset indexes.
> > Moreover,
> > >>> we
> > >>>>>>> also
> > >>>>>>>> introduced parquet.page.row.count.limit which might increase the
> > >>> number
> > >>>>>>> of
> > >>>>>>>> pages which leads to increasing the file size.
> > >>>>>>>> The footer itself is also changed and we are saving more values
> in
> > >>> it:
> > >>>>>>> the
> > >>>>>>>> offset values to the column/offset indexes, the new logical type
> > >>>>>>>> structures, the CRC checksums (we might have some others).
> > >>>>>>>> So, the size of the files with small amount of data will be
> > >> increased
> > >>>>>>>> (because of the larger footer). The size of the files where the
> > >>> values
> > >>>>>>> can
> > >>>>>>>> be encoded very well (RLE) will probably be increased (because
> we
> > >>> will
> > >>>>>>> have
> > >>>>>>>> more pages). The size of some files where the values are long
> > >>> (>64bytes
> > >>>>>>> by
> > >>>>>>>> default) might be decreased because of truncating the min/max
> > >> values.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Gabor
> > >>>>>>>>
> > >>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> > >> <rblue@netflix.com.invalid
> > >>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> > >> non-test
> > >>>>>>> data
> > >>>>>>>>> file? It should be easy to validate that this doesn't introduce
> > an
> > >>>>>>>>> unreasonable amount of overhead. In some cases, it should
> > actually
> > >>> be
> > >>>>>>>>> smaller since the column indexes are truncated and page stats
> are
> > >>>>>> not.
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > >>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Fokko,
> > >>>>>>>>>>
> > >>>>>>>>>> For the first point. The referenced constructor is private and
> > >>>>>>> Iceberg
> > >>>>>>>>> uses
> > >>>>>>>>>> it via reflection. It is not a breaking change. I think,
> > >> parquet-mr
> > >>>>>>>> shall
> > >>>>>>>>>> not keep private methods only because of clients might use
> them
> > >> via
> > >>>>>>>>>> reflection.
> > >>>>>>>>>>
> > >>>>>>>>>> About the checksum. I've agreed on having the CRC checksum
> write
> > >>>>>>>> enabled
> > >>>>>>>>> by
> > >>>>>>>>>> default because the benchmarks did not show significant
> > >> performance
> > >>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
> > for
> > >>>>>>>>> details.
> > >>>>>>>>>>
> > >>>>>>>>>> About the file size change. 1.11.0 is introducing column
> > indexes,
> > >>>>>> CRC
> > >>>>>>>>>> checksum, removing the statistics from the page headers and
> > maybe
> > >>>>>>> other
> > >>>>>>>>>> changes that impact file size. If only file size is in
> question
> > I
> > >>>>>>>> cannot
> > >>>>>>>>>> see a breaking change here.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Gabor
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > >>>>>>> <fokko@driesprong.frl
> > >>>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > >>>>>>>>>>>
> > >>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three
> things:
> > >>>>>>>>>>>
> > >>>>>>>>>>> - We've broken backward compatibility of the constructor of
> > >>>>>>>>>>> ColumnChunkPageWriteStore
> > >>>>>>>>>>> <
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > >>>>>>>>>>>> .
> > >>>>>>>>>>> This required a change
> > >>>>>>>>>>> <
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > >>>>>>>>>>>>
> > >>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be
> a
> > >>>>>>> new
> > >>>>>>>>> RC,
> > >>>>>>>>>>> I've
> > >>>>>>>>>>> submitted a patch:
> > >>>>>>> https://github.com/apache/parquet-mr/pull/699
> > >>>>>>>>>>> - Related, that we need to put in the changelog, is that
> > >>>>>>> checksums
> > >>>>>>>>> are
> > >>>>>>>>>>> enabled by default:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > >>>>>>>>>>> This
> > >>>>>>>>>>> will impact performance. I would suggest disabling it by
> > >>>>>>> default:
> > >>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> > >>>>>>>>>>> <
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > >>>>>>>>>>>>
> > >>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
> > >>>>>>> that
> > >>>>>>>>> the
> > >>>>>>>>>>> split-test was failing:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > >>>>>>>>>>> The
> > >>>>>>>>>>> two records are now divided over four Spark partitions.
> > >>>>>>> Something
> > >>>>>>>> in
> > >>>>>>>>>> the
> > >>>>>>>>>>> output has changed since the files are bigger now. Has anyone
> > >>>>>>> any
> > >>>>>>>>> idea
> > >>>>>>>>>>> to
> > >>>>>>>>>>> check what's changed, or a way to check this? The only thing
> I
> > >>>>>>> can
> > >>>>>>>>>>> think of
> > >>>>>>>>>>> is the checksum mentioned above.
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > >>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ parquet-tools cat
> > >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>>>>> id = 1
> > >>>>>>>>>>> data = a
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ parquet-tools cat
> > >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>>>> id = 1
> > >>>>>>>>>>> data = a
> > >>>>>>>>>>>
> > >>>>>>>>>>> A binary diff here:
> > >>>>>>>>>>>
> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers, Fokko
> > >>>>>>>>>>>
> > >>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > >>>>>>>>>> chenjunjiedada@gmail.com
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>
> > >>>>>>>>>>>> +1
> > >>>>>>>>>>>> Verified signature, checksum and ran mvn install
> successfully.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > >>>>>> 下午2:05写道：
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> +1
> > >>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > >>>>>>>>>> "sql/test-only"
> > >>>>>>>>>>>> -Phadoop-3.2
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <
> gabor@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  Hi everyone,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  I propose the following RC to be released as official
> > >>>>>>> Apache
> > >>>>>>>>>>> Parquet
> > >>>>>>>>>>>> 1.11.0
> > >>>>>>>>>>>>>  release.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > >>>>>>>>>>>>>  * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > >>>>>>>>>>>>>  *
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  The release tarball, signature, and checksums are here:
> > >>>>>>>>>>>>>  *
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  You can find the KEYS file here:
> > >>>>>>>>>>>>>  *
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  Binary artifacts are staged in Nexus here:
> > >>>>>>>>>>>>>  *
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  This release includes the changes listed at:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  Please download, verify, and test.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  Please vote in the next 72 hours.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>  [ ] +1 Release this as Apache Parquet 1.11.0
> > >>>>>>>>>>>>>  [ ] +0
> > >>>>>>>>>>>>>  [ ] -1 Do not release this because...
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Ryan Blue
> > >>>>>>>>> Software Engineer
> > >>>>>>>>> Netflix
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Ryan Blue
> > >>>>>>> Software Engineer
> > >>>>>>> Netflix
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Nandor Kollar <nk...@cloudera.com.INVALID>.

I'm not sure that this is a binary compatibility issue. The missing builder
method was recently added in 1.11.0 with the introduction of the new
logical type API, while the original version (one with a single
OriginalType input parameter called before by AvroSchemaConverter) of this
method is kept untouched. It seems to me that the Parquet version on Spark
executor mismatch: parquet-avro is on 1.11.0, but parquet-column is still
on an older version.

On Thu, Nov 21, 2019 at 11:41 PM Michael Heuer <he...@gmail.com> wrote:

> Perhaps not strictly necessary to say, but if this particular
> compatibility break between 1.10 and 1.11 was intentional, and no other
> compatibility breaks are found, I would vote -1 (non-binding) on this RC
> such that we might go back and revisit the changes to preserve
> compatibility.
>
> I am not sure there is presently enough motivation in the Spark project
> for a release after 2.4.4 and before 3.0 in which to bump the Parquet
> dependency version to 1.11.x.
>
>    michael
>
>
> > On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >
> > Gabor, shouldn't Parquet be binary compatible for public APIs? From the
> > stack trace, it looks like this 1.11.0 RC breaks binary compatibility in
> > the type builders.
> >
> > Looks like this should have been caught by the binary compatibility
> checks.
> >
> > On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org>
> wrote:
> >
> >> Hi Michael,
> >>
> >> Unfortunately, I don't have too much experience on Spark. But if spark
> uses
> >> the parquet-mr library in an embedded way (that's how Hive uses it) it
> is
> >> required to re-build Spark with 1.11 RC parquet-mr.
> >>
> >> Regards,
> >> Gabor
> >>
> >> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com>
> wrote:
> >>
> >>> It appears a provided scope dependency on spark-sql leaks old parquet
> >>> versions was causing the runtime error below.  After including new
> >>> parquet-column and parquet-hadoop compile scope dependencies (in
> addition
> >>> to parquet-avro, which we already have at compile scope), our build
> >>> succeeds.
> >>>
> >>> https://github.com/bigdatagenomics/adam/pull/2232 <
> >>> https://github.com/bigdatagenomics/adam/pull/2232>
> >>>
> >>> However, when running via spark-submit I run into a similar runtime
> error
> >>>
> >>> Caused by: java.lang.NoSuchMethodError:
> >>>
> >>
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> >>>        at
> >>>
> >>
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> >>>        at
> >>>
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >>>        at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >>>        at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >>>        at
> >>>
> >>
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >>>        at org.apache.spark.internal.io
> >>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >>>        at org.apache.spark.internal.io
> >>>
> >>
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >>>        at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >>>        at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >>>        at
> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >>>        at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >>>        at
> >>>
> >>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >>>        at
> >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >>>        at
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >>>        at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>        at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>        at java.lang.Thread.run(Thread.java:748)
> >>>
> >>>
> >>> Will bumping our library dependency version to 1.11 require a new
> version
> >>> of Spark, built against Parquet 1.11?
> >>>
> >>> Please accept my apologies if this is heading out-of-scope for the
> >> Parquet
> >>> mailing list.
> >>>
> >>>   michael
> >>>
> >>>
> >>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM>
> wrote:
> >>>>
> >>>> I am willing to do some benchmarking on genomic data at scale but am
> >> not
> >>> quite sure what the Spark target version for 1.11.0 might be.  Will
> >> Parquet
> >>> 1.11.0 be compatible in Spark 2.4.x?
> >>>>
> >>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> >>>>
> >>>> …
> >>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> >>> org/apache/parquet/schema/LogicalTypeAnnotation
> >>>>      at
> >>>
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >>>>      at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >>>>      at
> >>>
> >>
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >>>>      at
> >>>
> >>
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >>>>      at org.apache.spark.internal.io
> >>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >>>>      at org.apache.spark.internal.io
> >>>
> >>
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >>>>      at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >>>>      at org.apache.spark.internal.io
> >>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >>>>      at
> >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >>>>      at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >>>>      at
> >>>
> >>
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >>>>      at
> >>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >>>>      at
> >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >>>>      at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>>      at
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>>      at java.lang.Thread.run(Thread.java:748)
> >>>> Caused by: java.lang.ClassNotFoundException:
> >>> org.apache.parquet.schema.LogicalTypeAnnotation
> >>>>      at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> >>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >>>>      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> >>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >>>>
> >>>>  michael
> >>>>
> >>>>
> >>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
> >> wrote:
> >>>>>
> >>>>> Thanks, Fokko.
> >>>>>
> >>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> >>> enough
> >>>>> time to do that in the next couple of weeks.
> >>>>>
> >>>>> Cheers,
> >>>>> Gabor
> >>>>>
> >>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> >> <fokko@driesprong.frl
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> >>>>>> (non-binding).
> >>>>>>
> >>>>>> Cheers, Fokko
> >>>>>>
> >>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> >>> <rb...@netflix.com.invalid>
> >>>>>>
> >>>>>>> Gabor, what I meant was: have we tried this with real data to see
> >> the
> >>>>>>> effect? I think those results would be helpful.
> >>>>>>>
> >>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <
> gabor@apache.org
> >>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Ryan,
> >>>>>>>>
> >>>>>>>> It is not easy to calculate. For the column indexes feature we
> >>>>>> introduced
> >>>>>>>> two new structures saved before the footer: column indexes and
> >> offset
> >>>>>>>> indexes. If the min/max values are not too long, then the
> >> truncation
> >>>>>>> might
> >>>>>>>> not decrease the file size because of the offset indexes.
> Moreover,
> >>> we
> >>>>>>> also
> >>>>>>>> introduced parquet.page.row.count.limit which might increase the
> >>> number
> >>>>>>> of
> >>>>>>>> pages which leads to increasing the file size.
> >>>>>>>> The footer itself is also changed and we are saving more values in
> >>> it:
> >>>>>>> the
> >>>>>>>> offset values to the column/offset indexes, the new logical type
> >>>>>>>> structures, the CRC checksums (we might have some others).
> >>>>>>>> So, the size of the files with small amount of data will be
> >> increased
> >>>>>>>> (because of the larger footer). The size of the files where the
> >>> values
> >>>>>>> can
> >>>>>>>> be encoded very well (RLE) will probably be increased (because we
> >>> will
> >>>>>>> have
> >>>>>>>> more pages). The size of some files where the values are long
> >>> (>64bytes
> >>>>>>> by
> >>>>>>>> default) might be decreased because of truncating the min/max
> >> values.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Gabor
> >>>>>>>>
> >>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> >> <rblue@netflix.com.invalid
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Gabor, do we have an idea of the additional overhead for a
> >> non-test
> >>>>>>> data
> >>>>>>>>> file? It should be easy to validate that this doesn't introduce
> an
> >>>>>>>>> unreasonable amount of overhead. In some cases, it should
> actually
> >>> be
> >>>>>>>>> smaller since the column indexes are truncated and page stats are
> >>>>>> not.
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> >>>>>>>>> <ga...@cloudera.com.invalid> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Fokko,
> >>>>>>>>>>
> >>>>>>>>>> For the first point. The referenced constructor is private and
> >>>>>>> Iceberg
> >>>>>>>>> uses
> >>>>>>>>>> it via reflection. It is not a breaking change. I think,
> >> parquet-mr
> >>>>>>>> shall
> >>>>>>>>>> not keep private methods only because of clients might use them
> >> via
> >>>>>>>>>> reflection.
> >>>>>>>>>>
> >>>>>>>>>> About the checksum. I've agreed on having the CRC checksum write
> >>>>>>>> enabled
> >>>>>>>>> by
> >>>>>>>>>> default because the benchmarks did not show significant
> >> performance
> >>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647
> for
> >>>>>>>>> details.
> >>>>>>>>>>
> >>>>>>>>>> About the file size change. 1.11.0 is introducing column
> indexes,
> >>>>>> CRC
> >>>>>>>>>> checksum, removing the statistics from the page headers and
> maybe
> >>>>>>> other
> >>>>>>>>>> changes that impact file size. If only file size is in question
> I
> >>>>>>>> cannot
> >>>>>>>>>> see a breaking change here.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Gabor
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> >>>>>>> <fokko@driesprong.frl
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
> >>>>>>>>>>>
> >>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
> >>>>>>>>>>>
> >>>>>>>>>>> - We've broken backward compatibility of the constructor of
> >>>>>>>>>>> ColumnChunkPageWriteStore
> >>>>>>>>>>> <
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> >>>>>>>>>>>> .
> >>>>>>>>>>> This required a change
> >>>>>>>>>>> <
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> >>>>>>>>>>>>
> >>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be a
> >>>>>>> new
> >>>>>>>>> RC,
> >>>>>>>>>>> I've
> >>>>>>>>>>> submitted a patch:
> >>>>>>> https://github.com/apache/parquet-mr/pull/699
> >>>>>>>>>>> - Related, that we need to put in the changelog, is that
> >>>>>>> checksums
> >>>>>>>>> are
> >>>>>>>>>>> enabled by default:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> >>>>>>>>>>> This
> >>>>>>>>>>> will impact performance. I would suggest disabling it by
> >>>>>>> default:
> >>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
> >>>>>>>>>>> <
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> >>>>>>>>>>>>
> >>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
> >>>>>>> that
> >>>>>>>>> the
> >>>>>>>>>>> split-test was failing:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> >>>>>>>>>>> The
> >>>>>>>>>>> two records are now divided over four Spark partitions.
> >>>>>>> Something
> >>>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>> output has changed since the files are bigger now. Has anyone
> >>>>>>> any
> >>>>>>>>> idea
> >>>>>>>>>>> to
> >>>>>>>>>>> check what's changed, or a way to check this? The only thing I
> >>>>>>> can
> >>>>>>>>>>> think of
> >>>>>>>>>>> is the checksum mentioned above.
> >>>>>>>>>>>
> >>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> >>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> >>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>>>>
> >>>>>>>>>>> $ parquet-tools cat
> >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>>>>> id = 1
> >>>>>>>>>>> data = a
> >>>>>>>>>>>
> >>>>>>>>>>> $ parquet-tools cat
> >>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>>>> id = 1
> >>>>>>>>>>> data = a
> >>>>>>>>>>>
> >>>>>>>>>>> A binary diff here:
> >>>>>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers, Fokko
> >>>>>>>>>>>
> >>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> >>>>>>>>>> chenjunjiedada@gmail.com
> >>>>>>>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>>> +1
> >>>>>>>>>>>> Verified signature, checksum and ran mvn install successfully.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> >>>>>> 下午2:05写道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +1
> >>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> >>>>>>>>>> "sql/test-only"
> >>>>>>>>>>>> -Phadoop-3.2
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  I propose the following RC to be released as official
> >>>>>>> Apache
> >>>>>>>>>>> Parquet
> >>>>>>>>>>>> 1.11.0
> >>>>>>>>>>>>>  release.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> >>>>>>>>>>>>>  * This corresponds to the tag: apache-parquet-1.11.0-rc7
> >>>>>>>>>>>>>  *
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  The release tarball, signature, and checksums are here:
> >>>>>>>>>>>>>  *
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  You can find the KEYS file here:
> >>>>>>>>>>>>>  *
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  Binary artifacts are staged in Nexus here:
> >>>>>>>>>>>>>  *
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  This release includes the changes listed at:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  Please download, verify, and test.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  Please vote in the next 72 hours.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>  [ ] +1 Release this as Apache Parquet 1.11.0
> >>>>>>>>>>>>>  [ ] +0
> >>>>>>>>>>>>>  [ ] -1 Do not release this because...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Ryan Blue
> >>>>>>>>> Software Engineer
> >>>>>>>>> Netflix
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Ryan Blue
> >>>>>>> Software Engineer
> >>>>>>> Netflix
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Michael Heuer <he...@gmail.com>.

Perhaps not strictly necessary to say, but if this particular compatibility break between 1.10 and 1.11 was intentional, and no other compatibility breaks are found, I would vote -1 (non-binding) on this RC such that we might go back and revisit the changes to preserve compatibility.

I am not sure there is presently enough motivation in the Spark project for a release after 2.4.4 and before 3.0 in which to bump the Parquet dependency version to 1.11.x.

   michael


> On Nov 21, 2019, at 11:01 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Gabor, shouldn't Parquet be binary compatible for public APIs? From the
> stack trace, it looks like this 1.11.0 RC breaks binary compatibility in
> the type builders.
> 
> Looks like this should have been caught by the binary compatibility checks.
> 
> On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org> wrote:
> 
>> Hi Michael,
>> 
>> Unfortunately, I don't have too much experience on Spark. But if spark uses
>> the parquet-mr library in an embedded way (that's how Hive uses it) it is
>> required to re-build Spark with 1.11 RC parquet-mr.
>> 
>> Regards,
>> Gabor
>> 
>> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com> wrote:
>> 
>>> It appears a provided scope dependency on spark-sql leaks old parquet
>>> versions was causing the runtime error below.  After including new
>>> parquet-column and parquet-hadoop compile scope dependencies (in addition
>>> to parquet-avro, which we already have at compile scope), our build
>>> succeeds.
>>> 
>>> https://github.com/bigdatagenomics/adam/pull/2232 <
>>> https://github.com/bigdatagenomics/adam/pull/2232>
>>> 
>>> However, when running via spark-submit I run into a similar runtime error
>>> 
>>> Caused by: java.lang.NoSuchMethodError:
>>> 
>> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>>>        at
>>> 
>> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>>>        at
>>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>        at
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>        at
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>        at
>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>        at org.apache.spark.internal.io
>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>        at org.apache.spark.internal.io
>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>        at org.apache.spark.internal.io
>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>        at org.apache.spark.internal.io
>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>        at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>        at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>        at
>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>        at
>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>        at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>        at
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>        at
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>        at java.lang.Thread.run(Thread.java:748)
>>> 
>>> 
>>> Will bumping our library dependency version to 1.11 require a new version
>>> of Spark, built against Parquet 1.11?
>>> 
>>> Please accept my apologies if this is heading out-of-scope for the
>> Parquet
>>> mailing list.
>>> 
>>>   michael
>>> 
>>> 
>>>> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM> wrote:
>>>> 
>>>> I am willing to do some benchmarking on genomic data at scale but am
>> not
>>> quite sure what the Spark target version for 1.11.0 might be.  Will
>> Parquet
>>> 1.11.0 be compatible in Spark 2.4.x?
>>>> 
>>>> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
>>>> 
>>>> …
>>>> D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
>>> org/apache/parquet/schema/LogicalTypeAnnotation
>>>>      at
>>> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>>>>      at
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>>>>      at
>>> 
>> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>>>>      at
>>> 
>> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>>>>      at org.apache.spark.internal.io
>>> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>>>>      at org.apache.spark.internal.io
>>> 
>> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>>>>      at org.apache.spark.internal.io
>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>>>>      at org.apache.spark.internal.io
>>> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>>>>      at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>>>>      at org.apache.spark.scheduler.Task.run(Task.scala:123)
>>>>      at
>>> 
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>>>>      at
>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>>>>      at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>>>>      at
>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>      at
>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>      at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.parquet.schema.LogicalTypeAnnotation
>>>>      at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>>>>      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>> 
>>>>  michael
>>>> 
>>>> 
>>>>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
>> wrote:
>>>>> 
>>>>> Thanks, Fokko.
>>>>> 
>>>>> Ryan, we did not do such measurements yet. I'm afraid, I won't have
>>> enough
>>>>> time to do that in the next couple of weeks.
>>>>> 
>>>>> Cheers,
>>>>> Gabor
>>>>> 
>>>>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
>> <fokko@driesprong.frl
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>>>>>> (non-binding).
>>>>>> 
>>>>>> Cheers, Fokko
>>>>>> 
>>>>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
>>> <rb...@netflix.com.invalid>
>>>>>> 
>>>>>>> Gabor, what I meant was: have we tried this with real data to see
>> the
>>>>>>> effect? I think those results would be helpful.
>>>>>>> 
>>>>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <gabor@apache.org
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ryan,
>>>>>>>> 
>>>>>>>> It is not easy to calculate. For the column indexes feature we
>>>>>> introduced
>>>>>>>> two new structures saved before the footer: column indexes and
>> offset
>>>>>>>> indexes. If the min/max values are not too long, then the
>> truncation
>>>>>>> might
>>>>>>>> not decrease the file size because of the offset indexes. Moreover,
>>> we
>>>>>>> also
>>>>>>>> introduced parquet.page.row.count.limit which might increase the
>>> number
>>>>>>> of
>>>>>>>> pages which leads to increasing the file size.
>>>>>>>> The footer itself is also changed and we are saving more values in
>>> it:
>>>>>>> the
>>>>>>>> offset values to the column/offset indexes, the new logical type
>>>>>>>> structures, the CRC checksums (we might have some others).
>>>>>>>> So, the size of the files with small amount of data will be
>> increased
>>>>>>>> (because of the larger footer). The size of the files where the
>>> values
>>>>>>> can
>>>>>>>> be encoded very well (RLE) will probably be increased (because we
>>> will
>>>>>>> have
>>>>>>>> more pages). The size of some files where the values are long
>>> (>64bytes
>>>>>>> by
>>>>>>>> default) might be decreased because of truncating the min/max
>> values.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Gabor
>>>>>>>> 
>>>>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
>> <rblue@netflix.com.invalid
>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Gabor, do we have an idea of the additional overhead for a
>> non-test
>>>>>>> data
>>>>>>>>> file? It should be easy to validate that this doesn't introduce an
>>>>>>>>> unreasonable amount of overhead. In some cases, it should actually
>>> be
>>>>>>>>> smaller since the column indexes are truncated and page stats are
>>>>>> not.
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>>>>> <ga...@cloudera.com.invalid> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Fokko,
>>>>>>>>>> 
>>>>>>>>>> For the first point. The referenced constructor is private and
>>>>>>> Iceberg
>>>>>>>>> uses
>>>>>>>>>> it via reflection. It is not a breaking change. I think,
>> parquet-mr
>>>>>>>> shall
>>>>>>>>>> not keep private methods only because of clients might use them
>> via
>>>>>>>>>> reflection.
>>>>>>>>>> 
>>>>>>>>>> About the checksum. I've agreed on having the CRC checksum write
>>>>>>>> enabled
>>>>>>>>> by
>>>>>>>>>> default because the benchmarks did not show significant
>> performance
>>>>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
>>>>>>>>> details.
>>>>>>>>>> 
>>>>>>>>>> About the file size change. 1.11.0 is introducing column indexes,
>>>>>> CRC
>>>>>>>>>> checksum, removing the statistics from the page headers and maybe
>>>>>>> other
>>>>>>>>>> changes that impact file size. If only file size is in question I
>>>>>>>> cannot
>>>>>>>>>> see a breaking change here.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Gabor
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>>>>> <fokko@driesprong.frl
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>>>>> 
>>>>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
>>>>>>>>>>> 
>>>>>>>>>>> - We've broken backward compatibility of the constructor of
>>>>>>>>>>> ColumnChunkPageWriteStore
>>>>>>>>>>> <
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>>>>> .
>>>>>>>>>>> This required a change
>>>>>>>>>>> <
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>>>>> 
>>>>>>>>>>> to the code. This isn't a hard blocker, but if there will be a
>>>>>>> new
>>>>>>>>> RC,
>>>>>>>>>>> I've
>>>>>>>>>>> submitted a patch:
>>>>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>>>> - Related, that we need to put in the changelog, is that
>>>>>>> checksums
>>>>>>>>> are
>>>>>>>>>>> enabled by default:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>>>>> This
>>>>>>>>>>> will impact performance. I would suggest disabling it by
>>>>>>> default:
>>>>>>>>>>> https://github.com/apache/parquet-mr/pull/700
>>>>>>>>>>> <
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>>>>> 
>>>>>>>>>>> - Binary compatibility. While updating Iceberg, I've noticed
>>>>>>> that
>>>>>>>>> the
>>>>>>>>>>> split-test was failing:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>>>>> The
>>>>>>>>>>> two records are now divided over four Spark partitions.
>>>>>>> Something
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>> output has changed since the files are bigger now. Has anyone
>>>>>>> any
>>>>>>>>> idea
>>>>>>>>>>> to
>>>>>>>>>>> check what's changed, or a way to check this? The only thing I
>>>>>>> can
>>>>>>>>>>> think of
>>>>>>>>>>> is the checksum mentioned above.
>>>>>>>>>>> 
>>>>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>> 
>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>>>>> id = 1
>>>>>>>>>>> data = a
>>>>>>>>>>> 
>>>>>>>>>>> $ parquet-tools cat
>>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>>>>> id = 1
>>>>>>>>>>> data = a
>>>>>>>>>>> 
>>>>>>>>>>> A binary diff here:
>>>>>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>>>>> 
>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>> 
>>>>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>>>>> chenjunjiedada@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>>> +1
>>>>>>>>>>>> Verified signature, checksum and ran mvn install successfully.
>>>>>>>>>>>> 
>>>>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>>>>>> 下午2:05写道：
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +1
>>>>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>>>>> "sql/test-only"
>>>>>>>>>>>> -Phadoop-3.2
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Hi everyone,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  I propose the following RC to be released as official
>>>>>>> Apache
>>>>>>>>>>> Parquet
>>>>>>>>>>>> 1.11.0
>>>>>>>>>>>>>  release.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>>>>  * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>>>>  *
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  The release tarball, signature, and checksums are here:
>>>>>>>>>>>>>  *
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  You can find the KEYS file here:
>>>>>>>>>>>>>  *
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Binary artifacts are staged in Nexus here:
>>>>>>>>>>>>>  *
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  This release includes the changes listed at:
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Please download, verify, and test.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  Please vote in the next 72 hours.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>>>>  [ ] +0
>>>>>>>>>>>>>  [ ] -1 Do not release this because...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor, shouldn't Parquet be binary compatible for public APIs? From the
stack trace, it looks like this 1.11.0 RC breaks binary compatibility in
the type builders.

Looks like this should have been caught by the binary compatibility checks.

On Thu, Nov 21, 2019 at 8:56 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Michael,
>
> Unfortunately, I don't have too much experience on Spark. But if spark uses
> the parquet-mr library in an embedded way (that's how Hive uses it) it is
> required to re-build Spark with 1.11 RC parquet-mr.
>
> Regards,
> Gabor
>
> On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com> wrote:
>
> > It appears a provided scope dependency on spark-sql leaks old parquet
> > versions was causing the runtime error below.  After including new
> > parquet-column and parquet-hadoop compile scope dependencies (in addition
> > to parquet-avro, which we already have at compile scope), our build
> > succeeds.
> >
> > https://github.com/bigdatagenomics/adam/pull/2232 <
> > https://github.com/bigdatagenomics/adam/pull/2232>
> >
> > However, when running via spark-submit I run into a similar runtime error
> >
> > Caused by: java.lang.NoSuchMethodError:
> >
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
> >         at
> >
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
> >         at
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >         at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >         at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >         at
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >         at org.apache.spark.internal.io
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >         at org.apache.spark.internal.io
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >         at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >         at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >         at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >         at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >         at
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >         at
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >         at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> >
> >
> > Will bumping our library dependency version to 1.11 require a new version
> > of Spark, built against Parquet 1.11?
> >
> > Please accept my apologies if this is heading out-of-scope for the
> Parquet
> > mailing list.
> >
> >    michael
> >
> >
> > > On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM> wrote:
> > >
> > > I am willing to do some benchmarking on genomic data at scale but am
> not
> > quite sure what the Spark target version for 1.11.0 might be.  Will
> Parquet
> > 1.11.0 be compatible in Spark 2.4.x?
> > >
> > > Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> > >
> > > …
> > > D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> > org/apache/parquet/schema/LogicalTypeAnnotation
> > >       at
> > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> > >       at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> > >       at
> >
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> > >       at
> >
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> > >       at org.apache.spark.internal.io
> > .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> > >       at org.apache.spark.internal.io
> >
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> > >       at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> > >       at org.apache.spark.internal.io
> > .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> > >       at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> > >       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> > >       at
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > >       at
> > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> > >       at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> > >       at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >       at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >       at java.lang.Thread.run(Thread.java:748)
> > > Caused by: java.lang.ClassNotFoundException:
> > org.apache.parquet.schema.LogicalTypeAnnotation
> > >       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> > >       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > >       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> > >       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > >
> > >   michael
> > >
> > >
> > >> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org>
> wrote:
> > >>
> > >> Thanks, Fokko.
> > >>
> > >> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> > enough
> > >> time to do that in the next couple of weeks.
> > >>
> > >> Cheers,
> > >> Gabor
> > >>
> > >> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko
> <fokko@driesprong.frl
> > >
> > >> wrote:
> > >>
> > >>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> > >>> (non-binding).
> > >>>
> > >>> Cheers, Fokko
> > >>>
> > >>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> > <rb...@netflix.com.invalid>
> > >>>
> > >>>> Gabor, what I meant was: have we tried this with real data to see
> the
> > >>>> effect? I think those results would be helpful.
> > >>>>
> > >>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <gabor@apache.org
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Ryan,
> > >>>>>
> > >>>>> It is not easy to calculate. For the column indexes feature we
> > >>> introduced
> > >>>>> two new structures saved before the footer: column indexes and
> offset
> > >>>>> indexes. If the min/max values are not too long, then the
> truncation
> > >>>> might
> > >>>>> not decrease the file size because of the offset indexes. Moreover,
> > we
> > >>>> also
> > >>>>> introduced parquet.page.row.count.limit which might increase the
> > number
> > >>>> of
> > >>>>> pages which leads to increasing the file size.
> > >>>>> The footer itself is also changed and we are saving more values in
> > it:
> > >>>> the
> > >>>>> offset values to the column/offset indexes, the new logical type
> > >>>>> structures, the CRC checksums (we might have some others).
> > >>>>> So, the size of the files with small amount of data will be
> increased
> > >>>>> (because of the larger footer). The size of the files where the
> > values
> > >>>> can
> > >>>>> be encoded very well (RLE) will probably be increased (because we
> > will
> > >>>> have
> > >>>>> more pages). The size of some files where the values are long
> > (>64bytes
> > >>>> by
> > >>>>> default) might be decreased because of truncating the min/max
> values.
> > >>>>>
> > >>>>> Regards,
> > >>>>> Gabor
> > >>>>>
> > >>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue
> <rblue@netflix.com.invalid
> > >
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Gabor, do we have an idea of the additional overhead for a
> non-test
> > >>>> data
> > >>>>>> file? It should be easy to validate that this doesn't introduce an
> > >>>>>> unreasonable amount of overhead. In some cases, it should actually
> > be
> > >>>>>> smaller since the column indexes are truncated and page stats are
> > >>> not.
> > >>>>>>
> > >>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > >>>>>> <ga...@cloudera.com.invalid> wrote:
> > >>>>>>
> > >>>>>>> Hi Fokko,
> > >>>>>>>
> > >>>>>>> For the first point. The referenced constructor is private and
> > >>>> Iceberg
> > >>>>>> uses
> > >>>>>>> it via reflection. It is not a breaking change. I think,
> parquet-mr
> > >>>>> shall
> > >>>>>>> not keep private methods only because of clients might use them
> via
> > >>>>>>> reflection.
> > >>>>>>>
> > >>>>>>> About the checksum. I've agreed on having the CRC checksum write
> > >>>>> enabled
> > >>>>>> by
> > >>>>>>> default because the benchmarks did not show significant
> performance
> > >>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > >>>>>> details.
> > >>>>>>>
> > >>>>>>> About the file size change. 1.11.0 is introducing column indexes,
> > >>> CRC
> > >>>>>>> checksum, removing the statistics from the page headers and maybe
> > >>>> other
> > >>>>>>> changes that impact file size. If only file size is in question I
> > >>>>> cannot
> > >>>>>>> see a breaking change here.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Gabor
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > >>>> <fokko@driesprong.frl
> > >>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Unfortunately, a -1 from my side (non-binding)
> > >>>>>>>>
> > >>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
> > >>>>>>>>
> > >>>>>>>>  - We've broken backward compatibility of the constructor of
> > >>>>>>>>  ColumnChunkPageWriteStore
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > >>>>>>>>> .
> > >>>>>>>>  This required a change
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > >>>>>>>>>
> > >>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
> > >>>> new
> > >>>>>> RC,
> > >>>>>>>> I've
> > >>>>>>>>  submitted a patch:
> > >>>> https://github.com/apache/parquet-mr/pull/699
> > >>>>>>>>  - Related, that we need to put in the changelog, is that
> > >>>> checksums
> > >>>>>> are
> > >>>>>>>>  enabled by default:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > >>>>>>>> This
> > >>>>>>>>  will impact performance. I would suggest disabling it by
> > >>>> default:
> > >>>>>>>>  https://github.com/apache/parquet-mr/pull/700
> > >>>>>>>>  <
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > >>>>>>>>>
> > >>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
> > >>>> that
> > >>>>>> the
> > >>>>>>>>  split-test was failing:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > >>>>>>>> The
> > >>>>>>>>  two records are now divided over four Spark partitions.
> > >>>> Something
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>  output has changed since the files are bigger now. Has anyone
> > >>>> any
> > >>>>>> idea
> > >>>>>>>> to
> > >>>>>>>>  check what's changed, or a way to check this? The only thing I
> > >>>> can
> > >>>>>>>> think of
> > >>>>>>>>  is the checksum mentioned above.
> > >>>>>>>>
> > >>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> > >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>>
> > >>>>>>>> $ parquet-tools cat
> > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > >>>>>>>> id = 1
> > >>>>>>>> data = a
> > >>>>>>>>
> > >>>>>>>> $ parquet-tools cat
> > >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >>>>>>>> id = 1
> > >>>>>>>> data = a
> > >>>>>>>>
> > >>>>>>>> A binary diff here:
> > >>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > >>>>>>>>
> > >>>>>>>> Cheers, Fokko
> > >>>>>>>>
> > >>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > >>>>>>> chenjunjiedada@gmail.com
> > >>>>>>>>> :
> > >>>>>>>>
> > >>>>>>>>> +1
> > >>>>>>>>> Verified signature, checksum and ran mvn install successfully.
> > >>>>>>>>>
> > >>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> > >>> 下午2:05写道：
> > >>>>>>>>>>
> > >>>>>>>>>> +1
> > >>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > >>>>>>> "sql/test-only"
> > >>>>>>>>> -Phadoop-3.2
> > >>>>>>>>>>
> > >>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > >>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>   Hi everyone,
> > >>>>>>>>>>
> > >>>>>>>>>>   I propose the following RC to be released as official
> > >>>> Apache
> > >>>>>>>> Parquet
> > >>>>>>>>> 1.11.0
> > >>>>>>>>>>   release.
> > >>>>>>>>>>
> > >>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > >>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > >>>>>>>>>>   *
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   The release tarball, signature, and checksums are here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   You can find the KEYS file here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   Binary artifacts are staged in Nexus here:
> > >>>>>>>>>>   *
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   This release includes the changes listed at:
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > >>>>>>>>>>
> > >>>>>>>>>>   Please download, verify, and test.
> > >>>>>>>>>>
> > >>>>>>>>>>   Please vote in the next 72 hours.
> > >>>>>>>>>>
> > >>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
> > >>>>>>>>>>   [ ] +0
> > >>>>>>>>>>   [ ] -1 Do not release this because...
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Ryan Blue
> > >>>>>> Software Engineer
> > >>>>>> Netflix
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Ryan Blue
> > >>>> Software Engineer
> > >>>> Netflix
> > >>>>
> > >>>
> > >
> >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Michael,

Unfortunately, I don't have too much experience on Spark. But if spark uses
the parquet-mr library in an embedded way (that's how Hive uses it) it is
required to re-build Spark with 1.11 RC parquet-mr.

Regards,
Gabor

On Wed, Nov 20, 2019 at 5:44 PM Michael Heuer <he...@gmail.com> wrote:

> It appears a provided scope dependency on spark-sql leaks old parquet
> versions was causing the runtime error below.  After including new
> parquet-column and parquet-hadoop compile scope dependencies (in addition
> to parquet-avro, which we already have at compile scope), our build
> succeeds.
>
> https://github.com/bigdatagenomics/adam/pull/2232 <
> https://github.com/bigdatagenomics/adam/pull/2232>
>
> However, when running via spark-submit I run into a similar runtime error
>
> Caused by: java.lang.NoSuchMethodError:
> org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
>         at
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
>         at
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
>         at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
>         at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>         at
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
>         at org.apache.spark.internal.io
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
>         at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:123)
>         at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>         at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>
>
> Will bumping our library dependency version to 1.11 require a new version
> of Spark, built against Parquet 1.11?
>
> Please accept my apologies if this is heading out-of-scope for the Parquet
> mailing list.
>
>    michael
>
>
> > On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM> wrote:
> >
> > I am willing to do some benchmarking on genomic data at scale but am not
> quite sure what the Spark target version for 1.11.0 might be.  Will Parquet
> 1.11.0 be compatible in Spark 2.4.x?
> >
> > Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> >
> > …
> > D 0, localhost, executor driver): java.lang.NoClassDefFoundError:
> org/apache/parquet/schema/LogicalTypeAnnotation
> >       at
> org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> >       at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> >       at
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> >       at
> org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> >       at org.apache.spark.internal.io
> .HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> >       at org.apache.spark.internal.io
> .SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> >       at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> >       at org.apache.spark.scheduler.Task.run(Task.scala:123)
> >       at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> >       at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> >       at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> >       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >       at java.lang.Thread.run(Thread.java:748)
> > Caused by: java.lang.ClassNotFoundException:
> org.apache.parquet.schema.LogicalTypeAnnotation
> >       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> >       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> >       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >
> >   michael
> >
> >
> >> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org> wrote:
> >>
> >> Thanks, Fokko.
> >>
> >> Ryan, we did not do such measurements yet. I'm afraid, I won't have
> enough
> >> time to do that in the next couple of weeks.
> >>
> >> Cheers,
> >> Gabor
> >>
> >> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> >> wrote:
> >>
> >>> Thanks Gabor for the explanation. I'd like to change my vote to +1
> >>> (non-binding).
> >>>
> >>> Cheers, Fokko
> >>>
> >>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue
> <rb...@netflix.com.invalid>
> >>>
> >>>> Gabor, what I meant was: have we tried this with real data to see the
> >>>> effect? I think those results would be helpful.
> >>>>
> >>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> It is not easy to calculate. For the column indexes feature we
> >>> introduced
> >>>>> two new structures saved before the footer: column indexes and offset
> >>>>> indexes. If the min/max values are not too long, then the truncation
> >>>> might
> >>>>> not decrease the file size because of the offset indexes. Moreover,
> we
> >>>> also
> >>>>> introduced parquet.page.row.count.limit which might increase the
> number
> >>>> of
> >>>>> pages which leads to increasing the file size.
> >>>>> The footer itself is also changed and we are saving more values in
> it:
> >>>> the
> >>>>> offset values to the column/offset indexes, the new logical type
> >>>>> structures, the CRC checksums (we might have some others).
> >>>>> So, the size of the files with small amount of data will be increased
> >>>>> (because of the larger footer). The size of the files where the
> values
> >>>> can
> >>>>> be encoded very well (RLE) will probably be increased (because we
> will
> >>>> have
> >>>>> more pages). The size of some files where the values are long
> (>64bytes
> >>>> by
> >>>>> default) might be decreased because of truncating the min/max values.
> >>>>>
> >>>>> Regards,
> >>>>> Gabor
> >>>>>
> >>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rblue@netflix.com.invalid
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Gabor, do we have an idea of the additional overhead for a non-test
> >>>> data
> >>>>>> file? It should be easy to validate that this doesn't introduce an
> >>>>>> unreasonable amount of overhead. In some cases, it should actually
> be
> >>>>>> smaller since the column indexes are truncated and page stats are
> >>> not.
> >>>>>>
> >>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> >>>>>> <ga...@cloudera.com.invalid> wrote:
> >>>>>>
> >>>>>>> Hi Fokko,
> >>>>>>>
> >>>>>>> For the first point. The referenced constructor is private and
> >>>> Iceberg
> >>>>>> uses
> >>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
> >>>>> shall
> >>>>>>> not keep private methods only because of clients might use them via
> >>>>>>> reflection.
> >>>>>>>
> >>>>>>> About the checksum. I've agreed on having the CRC checksum write
> >>>>> enabled
> >>>>>> by
> >>>>>>> default because the benchmarks did not show significant performance
> >>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
> >>>>>> details.
> >>>>>>>
> >>>>>>> About the file size change. 1.11.0 is introducing column indexes,
> >>> CRC
> >>>>>>> checksum, removing the statistics from the page headers and maybe
> >>>> other
> >>>>>>> changes that impact file size. If only file size is in question I
> >>>>> cannot
> >>>>>>> see a breaking change here.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> >>>> <fokko@driesprong.frl
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Unfortunately, a -1 from my side (non-binding)
> >>>>>>>>
> >>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
> >>>>>>>>
> >>>>>>>>  - We've broken backward compatibility of the constructor of
> >>>>>>>>  ColumnChunkPageWriteStore
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> >>>>>>>>> .
> >>>>>>>>  This required a change
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> >>>>>>>>>
> >>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
> >>>> new
> >>>>>> RC,
> >>>>>>>> I've
> >>>>>>>>  submitted a patch:
> >>>> https://github.com/apache/parquet-mr/pull/699
> >>>>>>>>  - Related, that we need to put in the changelog, is that
> >>>> checksums
> >>>>>> are
> >>>>>>>>  enabled by default:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> >>>>>>>> This
> >>>>>>>>  will impact performance. I would suggest disabling it by
> >>>> default:
> >>>>>>>>  https://github.com/apache/parquet-mr/pull/700
> >>>>>>>>  <
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> >>>>>>>>>
> >>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
> >>>> that
> >>>>>> the
> >>>>>>>>  split-test was failing:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> >>>>>>>> The
> >>>>>>>>  two records are now divided over four Spark partitions.
> >>>> Something
> >>>>> in
> >>>>>>> the
> >>>>>>>>  output has changed since the files are bigger now. Has anyone
> >>>> any
> >>>>>> idea
> >>>>>>>> to
> >>>>>>>>  check what's changed, or a way to check this? The only thing I
> >>>> can
> >>>>>>>> think of
> >>>>>>>>  is the checksum mentioned above.
> >>>>>>>>
> >>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
> >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> >>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>>
> >>>>>>>> $ parquet-tools cat
> >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> >>>>>>>> id = 1
> >>>>>>>> data = a
> >>>>>>>>
> >>>>>>>> $ parquet-tools cat
> >>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >>>>>>>> id = 1
> >>>>>>>> data = a
> >>>>>>>>
> >>>>>>>> A binary diff here:
> >>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> >>>>>>>>
> >>>>>>>> Cheers, Fokko
> >>>>>>>>
> >>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> >>>>>>> chenjunjiedada@gmail.com
> >>>>>>>>> :
> >>>>>>>>
> >>>>>>>>> +1
> >>>>>>>>> Verified signature, checksum and ran mvn install successfully.
> >>>>>>>>>
> >>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> >>> 下午2:05写道：
> >>>>>>>>>>
> >>>>>>>>>> +1
> >>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> >>>>>>> "sql/test-only"
> >>>>>>>>> -Phadoop-3.2
> >>>>>>>>>>
> >>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>   Hi everyone,
> >>>>>>>>>>
> >>>>>>>>>>   I propose the following RC to be released as official
> >>>> Apache
> >>>>>>>> Parquet
> >>>>>>>>> 1.11.0
> >>>>>>>>>>   release.
> >>>>>>>>>>
> >>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> >>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
> >>>>>>>>>>   *
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   The release tarball, signature, and checksums are here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   You can find the KEYS file here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   Binary artifacts are staged in Nexus here:
> >>>>>>>>>>   *
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   This release includes the changes listed at:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> >>>>>>>>>>
> >>>>>>>>>>   Please download, verify, and test.
> >>>>>>>>>>
> >>>>>>>>>>   Please vote in the next 72 hours.
> >>>>>>>>>>
> >>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
> >>>>>>>>>>   [ ] +0
> >>>>>>>>>>   [ ] -1 Do not release this because...
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ryan Blue
> >>>>>> Software Engineer
> >>>>>> Netflix
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>>
> >>>
> >
>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Michael Heuer <he...@gmail.com>.

It appears a provided scope dependency on spark-sql leaks old parquet versions was causing the runtime error below.  After including new parquet-column and parquet-hadoop compile scope dependencies (in addition to parquet-avro, which we already have at compile scope), our build succeeds.

https://github.com/bigdatagenomics/adam/pull/2232 <https://github.com/bigdatagenomics/adam/pull/2232>

However, when running via spark-submit I run into a similar runtime error

Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$PrimitiveBuilder.as(Lorg/apache/parquet/schema/LogicalTypeAnnotation;)Lorg/apache/parquet/schema/Types$Builder;
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:161)
	at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:226)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:182)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:141)
	at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:244)
	at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:135)
	at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:126)
	at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Will bumping our library dependency version to 1.11 require a new version of Spark, built against Parquet 1.11?

Please accept my apologies if this is heading out-of-scope for the Parquet mailing list.

   michael


> On Nov 20, 2019, at 10:00 AM, Michael Heuer <he...@GMAIL.COM> wrote:
> 
> I am willing to do some benchmarking on genomic data at scale but am not quite sure what the Spark target version for 1.11.0 might be.  Will Parquet 1.11.0 be compatible in Spark 2.4.x?
> 
> Updating from 1.10.1 to 1.11.0 breaks at runtime in our build
> 
> …
> D 0, localhost, executor driver): java.lang.NoClassDefFoundError: org/apache/parquet/schema/LogicalTypeAnnotation
> 	at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
> 	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
> 	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
> 	at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
> 	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
> 	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
> 	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
> 	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:123)
> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException: org.apache.parquet.schema.LogicalTypeAnnotation
> 	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> 	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> 
>   michael
> 
> 
>> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org> wrote:
>> 
>> Thanks, Fokko.
>> 
>> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
>> time to do that in the next couple of weeks.
>> 
>> Cheers,
>> Gabor
>> 
>> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>> 
>>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>>> (non-binding).
>>> 
>>> Cheers, Fokko
>>> 
>>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <rb...@netflix.com.invalid>
>>> 
>>>> Gabor, what I meant was: have we tried this with real data to see the
>>>> effect? I think those results would be helpful.
>>>> 
>>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
>>>> wrote:
>>>> 
>>>>> Hi Ryan,
>>>>> 
>>>>> It is not easy to calculate. For the column indexes feature we
>>> introduced
>>>>> two new structures saved before the footer: column indexes and offset
>>>>> indexes. If the min/max values are not too long, then the truncation
>>>> might
>>>>> not decrease the file size because of the offset indexes. Moreover, we
>>>> also
>>>>> introduced parquet.page.row.count.limit which might increase the number
>>>> of
>>>>> pages which leads to increasing the file size.
>>>>> The footer itself is also changed and we are saving more values in it:
>>>> the
>>>>> offset values to the column/offset indexes, the new logical type
>>>>> structures, the CRC checksums (we might have some others).
>>>>> So, the size of the files with small amount of data will be increased
>>>>> (because of the larger footer). The size of the files where the values
>>>> can
>>>>> be encoded very well (RLE) will probably be increased (because we will
>>>> have
>>>>> more pages). The size of some files where the values are long (>64bytes
>>>> by
>>>>> default) might be decreased because of truncating the min/max values.
>>>>> 
>>>>> Regards,
>>>>> Gabor
>>>>> 
>>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>> 
>>>>>> Gabor, do we have an idea of the additional overhead for a non-test
>>>> data
>>>>>> file? It should be easy to validate that this doesn't introduce an
>>>>>> unreasonable amount of overhead. In some cases, it should actually be
>>>>>> smaller since the column indexes are truncated and page stats are
>>> not.
>>>>>> 
>>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>>> <ga...@cloudera.com.invalid> wrote:
>>>>>> 
>>>>>>> Hi Fokko,
>>>>>>> 
>>>>>>> For the first point. The referenced constructor is private and
>>>> Iceberg
>>>>>> uses
>>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
>>>>> shall
>>>>>>> not keep private methods only because of clients might use them via
>>>>>>> reflection.
>>>>>>> 
>>>>>>> About the checksum. I've agreed on having the CRC checksum write
>>>>> enabled
>>>>>> by
>>>>>>> default because the benchmarks did not show significant performance
>>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
>>>>>> details.
>>>>>>> 
>>>>>>> About the file size change. 1.11.0 is introducing column indexes,
>>> CRC
>>>>>>> checksum, removing the statistics from the page headers and maybe
>>>> other
>>>>>>> changes that impact file size. If only file size is in question I
>>>>> cannot
>>>>>>> see a breaking change here.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>>> <fokko@driesprong.frl
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>>> 
>>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
>>>>>>>> 
>>>>>>>>  - We've broken backward compatibility of the constructor of
>>>>>>>>  ColumnChunkPageWriteStore
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>>> .
>>>>>>>>  This required a change
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>>> 
>>>>>>>>  to the code. This isn't a hard blocker, but if there will be a
>>>> new
>>>>>> RC,
>>>>>>>> I've
>>>>>>>>  submitted a patch:
>>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>>  - Related, that we need to put in the changelog, is that
>>>> checksums
>>>>>> are
>>>>>>>>  enabled by default:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>>> This
>>>>>>>>  will impact performance. I would suggest disabling it by
>>>> default:
>>>>>>>>  https://github.com/apache/parquet-mr/pull/700
>>>>>>>>  <
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>>> 
>>>>>>>>  - Binary compatibility. While updating Iceberg, I've noticed
>>>> that
>>>>>> the
>>>>>>>>  split-test was failing:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>>> The
>>>>>>>>  two records are now divided over four Spark partitions.
>>>> Something
>>>>> in
>>>>>>> the
>>>>>>>>  output has changed since the files are bigger now. Has anyone
>>>> any
>>>>>> idea
>>>>>>>> to
>>>>>>>>  check what's changed, or a way to check this? The only thing I
>>>> can
>>>>>>>> think of
>>>>>>>>  is the checksum mentioned above.
>>>>>>>> 
>>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>> 
>>>>>>>> $ parquet-tools cat
>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>>> id = 1
>>>>>>>> data = a
>>>>>>>> 
>>>>>>>> $ parquet-tools cat
>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>>> id = 1
>>>>>>>> data = a
>>>>>>>> 
>>>>>>>> A binary diff here:
>>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>>> 
>>>>>>>> Cheers, Fokko
>>>>>>>> 
>>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>>> chenjunjiedada@gmail.com
>>>>>>>>> :
>>>>>>>> 
>>>>>>>>> +1
>>>>>>>>> Verified signature, checksum and ran mvn install successfully.
>>>>>>>>> 
>>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>>> 下午2:05写道：
>>>>>>>>>> 
>>>>>>>>>> +1
>>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>>> "sql/test-only"
>>>>>>>>> -Phadoop-3.2
>>>>>>>>>> 
>>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   Hi everyone,
>>>>>>>>>> 
>>>>>>>>>>   I propose the following RC to be released as official
>>>> Apache
>>>>>>>> Parquet
>>>>>>>>> 1.11.0
>>>>>>>>>>   release.
>>>>>>>>>> 
>>>>>>>>>>   The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>>   * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>>   *
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   The release tarball, signature, and checksums are here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   You can find the KEYS file here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   Binary artifacts are staged in Nexus here:
>>>>>>>>>>   *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   This release includes the changes listed at:
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>>> 
>>>>>>>>>>   Please download, verify, and test.
>>>>>>>>>> 
>>>>>>>>>>   Please vote in the next 72 hours.
>>>>>>>>>> 
>>>>>>>>>>   [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>>   [ ] +0
>>>>>>>>>>   [ ] -1 Do not release this because...
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>> 
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Michael Heuer <he...@gmail.com>.

I am willing to do some benchmarking on genomic data at scale but am not quite sure what the Spark target version for 1.11.0 might be.  Will Parquet 1.11.0 be compatible in Spark 2.4.x?

Updating from 1.10.1 to 1.11.0 breaks at runtime in our build

…
D 0, localhost, executor driver): java.lang.NoClassDefFoundError: org/apache/parquet/schema/LogicalTypeAnnotation
	at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:121)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.initWriter(SparkHadoopWriter.scala:350)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:120)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.parquet.schema.LogicalTypeAnnotation
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

   michael


> On Nov 20, 2019, at 3:29 AM, Gabor Szadovszky <ga...@apache.org> wrote:
> 
> Thanks, Fokko.
> 
> Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
> time to do that in the next couple of weeks.
> 
> Cheers,
> Gabor
> 
> On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> 
>> Thanks Gabor for the explanation. I'd like to change my vote to +1
>> (non-binding).
>> 
>> Cheers, Fokko
>> 
>> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <rb...@netflix.com.invalid>
>> 
>>> Gabor, what I meant was: have we tried this with real data to see the
>>> effect? I think those results would be helpful.
>>> 
>>> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
>>> wrote:
>>> 
>>>> Hi Ryan,
>>>> 
>>>> It is not easy to calculate. For the column indexes feature we
>> introduced
>>>> two new structures saved before the footer: column indexes and offset
>>>> indexes. If the min/max values are not too long, then the truncation
>>> might
>>>> not decrease the file size because of the offset indexes. Moreover, we
>>> also
>>>> introduced parquet.page.row.count.limit which might increase the number
>>> of
>>>> pages which leads to increasing the file size.
>>>> The footer itself is also changed and we are saving more values in it:
>>> the
>>>> offset values to the column/offset indexes, the new logical type
>>>> structures, the CRC checksums (we might have some others).
>>>> So, the size of the files with small amount of data will be increased
>>>> (because of the larger footer). The size of the files where the values
>>> can
>>>> be encoded very well (RLE) will probably be increased (because we will
>>> have
>>>> more pages). The size of some files where the values are long (>64bytes
>>> by
>>>> default) might be decreased because of truncating the min/max values.
>>>> 
>>>> Regards,
>>>> Gabor
>>>> 
>>>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>> 
>>>>> Gabor, do we have an idea of the additional overhead for a non-test
>>> data
>>>>> file? It should be easy to validate that this doesn't introduce an
>>>>> unreasonable amount of overhead. In some cases, it should actually be
>>>>> smaller since the column indexes are truncated and page stats are
>> not.
>>>>> 
>>>>> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>>>>> <ga...@cloudera.com.invalid> wrote:
>>>>> 
>>>>>> Hi Fokko,
>>>>>> 
>>>>>> For the first point. The referenced constructor is private and
>>> Iceberg
>>>>> uses
>>>>>> it via reflection. It is not a breaking change. I think, parquet-mr
>>>> shall
>>>>>> not keep private methods only because of clients might use them via
>>>>>> reflection.
>>>>>> 
>>>>>> About the checksum. I've agreed on having the CRC checksum write
>>>> enabled
>>>>> by
>>>>>> default because the benchmarks did not show significant performance
>>>>>> penalties. See https://github.com/apache/parquet-mr/pull/647 for
>>>>> details.
>>>>>> 
>>>>>> About the file size change. 1.11.0 is introducing column indexes,
>> CRC
>>>>>> checksum, removing the statistics from the page headers and maybe
>>> other
>>>>>> changes that impact file size. If only file size is in question I
>>>> cannot
>>>>>> see a breaking change here.
>>>>>> 
>>>>>> Regards,
>>>>>> Gabor
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>>> <fokko@driesprong.frl
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Unfortunately, a -1 from my side (non-binding)
>>>>>>> 
>>>>>>> I've updated Iceberg to Parquet 1.11.0, and found three things:
>>>>>>> 
>>>>>>>   - We've broken backward compatibility of the constructor of
>>>>>>>   ColumnChunkPageWriteStore
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>>>>>>>> .
>>>>>>>   This required a change
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>>>>>>>> 
>>>>>>>   to the code. This isn't a hard blocker, but if there will be a
>>> new
>>>>> RC,
>>>>>>> I've
>>>>>>>   submitted a patch:
>>> https://github.com/apache/parquet-mr/pull/699
>>>>>>>   - Related, that we need to put in the changelog, is that
>>> checksums
>>>>> are
>>>>>>>   enabled by default:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>>>>>>> This
>>>>>>>   will impact performance. I would suggest disabling it by
>>> default:
>>>>>>>   https://github.com/apache/parquet-mr/pull/700
>>>>>>>   <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>>>>>>>> 
>>>>>>>   - Binary compatibility. While updating Iceberg, I've noticed
>>> that
>>>>> the
>>>>>>>   split-test was failing:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>>>>>>> The
>>>>>>>   two records are now divided over four Spark partitions.
>>> Something
>>>> in
>>>>>> the
>>>>>>>   output has changed since the files are bigger now. Has anyone
>>> any
>>>>> idea
>>>>>>> to
>>>>>>>   check what's changed, or a way to check this? The only thing I
>>> can
>>>>>>> think of
>>>>>>>   is the checksum mentioned above.
>>>>>>> 
>>>>>>> $ ls -lah ~/Desktop/parquet-1-1*
>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>>>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>> 
>>>>>>> $ parquet-tools cat
>>>>> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>>>>>>> id = 1
>>>>>>> data = a
>>>>>>> 
>>>>>>> $ parquet-tools cat
>>>>> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>>>>>>> id = 1
>>>>>>> data = a
>>>>>>> 
>>>>>>> A binary diff here:
>>>>>>> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>>>>>>> 
>>>>>>> Cheers, Fokko
>>>>>>> 
>>>>>>> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>>>>>> chenjunjiedada@gmail.com
>>>>>>>> :
>>>>>>> 
>>>>>>>> +1
>>>>>>>> Verified signature, checksum and ran mvn install successfully.
>>>>>>>> 
>>>>>>>> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>> 下午2:05写道：
>>>>>>>>> 
>>>>>>>>> +1
>>>>>>>>> Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>>>>>> "sql/test-only"
>>>>>>>> -Phadoop-3.2
>>>>>>>>> 
>>>>>>>>> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>    Hi everyone,
>>>>>>>>> 
>>>>>>>>>    I propose the following RC to be released as official
>>> Apache
>>>>>>> Parquet
>>>>>>>> 1.11.0
>>>>>>>>>    release.
>>>>>>>>> 
>>>>>>>>>    The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>>>>>>>>>    * This corresponds to the tag: apache-parquet-1.11.0-rc7
>>>>>>>>>    *
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    The release tarball, signature, and checksums are here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    You can find the KEYS file here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    Binary artifacts are staged in Nexus here:
>>>>>>>>>    *
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    This release includes the changes listed at:
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>>>>>>>>> 
>>>>>>>>>    Please download, verify, and test.
>>>>>>>>> 
>>>>>>>>>    Please vote in the next 72 hours.
>>>>>>>>> 
>>>>>>>>>    [ ] +1 Release this as Apache Parquet 1.11.0
>>>>>>>>>    [ ] +0
>>>>>>>>>    [ ] -1 Do not release this because...
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Thanks, Fokko.

Ryan, we did not do such measurements yet. I'm afraid, I won't have enough
time to do that in the next couple of weeks.

Cheers,
Gabor

On Tue, Nov 19, 2019 at 6:14 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Thanks Gabor for the explanation. I'd like to change my vote to +1
> (non-binding).
>
> Cheers, Fokko
>
> Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <rb...@netflix.com.invalid>
>
> > Gabor, what I meant was: have we tried this with real data to see the
> > effect? I think those results would be helpful.
> >
> > On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> >
> > > Hi Ryan,
> > >
> > > It is not easy to calculate. For the column indexes feature we
> introduced
> > > two new structures saved before the footer: column indexes and offset
> > > indexes. If the min/max values are not too long, then the truncation
> > might
> > > not decrease the file size because of the offset indexes. Moreover, we
> > also
> > > introduced parquet.page.row.count.limit which might increase the number
> > of
> > > pages which leads to increasing the file size.
> > > The footer itself is also changed and we are saving more values in it:
> > the
> > > offset values to the column/offset indexes, the new logical type
> > > structures, the CRC checksums (we might have some others).
> > > So, the size of the files with small amount of data will be increased
> > > (because of the larger footer). The size of the files where the values
> > can
> > > be encoded very well (RLE) will probably be increased (because we will
> > have
> > > more pages). The size of some files where the values are long (>64bytes
> > by
> > > default) might be decreased because of truncating the min/max values.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Gabor, do we have an idea of the additional overhead for a non-test
> > data
> > > > file? It should be easy to validate that this doesn't introduce an
> > > > unreasonable amount of overhead. In some cases, it should actually be
> > > > smaller since the column indexes are truncated and page stats are
> not.
> > > >
> > > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > > <ga...@cloudera.com.invalid> wrote:
> > > >
> > > > > Hi Fokko,
> > > > >
> > > > > For the first point. The referenced constructor is private and
> > Iceberg
> > > > uses
> > > > > it via reflection. It is not a breaking change. I think, parquet-mr
> > > shall
> > > > > not keep private methods only because of clients might use them via
> > > > > reflection.
> > > > >
> > > > > About the checksum. I've agreed on having the CRC checksum write
> > > enabled
> > > > by
> > > > > default because the benchmarks did not show significant performance
> > > > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > > > details.
> > > > >
> > > > > About the file size change. 1.11.0 is introducing column indexes,
> CRC
> > > > > checksum, removing the statistics from the page headers and maybe
> > other
> > > > > changes that impact file size. If only file size is in question I
> > > cannot
> > > > > see a breaking change here.
> > > > >
> > > > > Regards,
> > > > > Gabor
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> > <fokko@driesprong.frl
> > > >
> > > > > wrote:
> > > > >
> > > > > > Unfortunately, a -1 from my side (non-binding)
> > > > > >
> > > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > > > >
> > > > > >    - We've broken backward compatibility of the constructor of
> > > > > >    ColumnChunkPageWriteStore
> > > > > >    <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > > >.
> > > > > >    This required a change
> > > > > >    <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > > >
> > > > > >    to the code. This isn't a hard blocker, but if there will be a
> > new
> > > > RC,
> > > > > > I've
> > > > > >    submitted a patch:
> > https://github.com/apache/parquet-mr/pull/699
> > > > > >    - Related, that we need to put in the changelog, is that
> > checksums
> > > > are
> > > > > >    enabled by default:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > > This
> > > > > >    will impact performance. I would suggest disabling it by
> > default:
> > > > > >    https://github.com/apache/parquet-mr/pull/700
> > > > > >    <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > > >
> > > > > >    - Binary compatibility. While updating Iceberg, I've noticed
> > that
> > > > the
> > > > > >    split-test was failing:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > > The
> > > > > >    two records are now divided over four Spark partitions.
> > Something
> > > in
> > > > > the
> > > > > >    output has changed since the files are bigger now. Has anyone
> > any
> > > > idea
> > > > > > to
> > > > > >    check what's changed, or a way to check this? The only thing I
> > can
> > > > > > think of
> > > > > >    is the checksum mentioned above.
> > > > > >
> > > > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > >
> > > > > > $ parquet-tools cat
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > > id = 1
> > > > > > data = a
> > > > > >
> > > > > > $ parquet-tools cat
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > > id = 1
> > > > > > data = a
> > > > > >
> > > > > > A binary diff here:
> > > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > > >
> > > > > > Cheers, Fokko
> > > > > >
> > > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > > chenjunjiedada@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > +1
> > > > > > > Verified signature, checksum and ran mvn install successfully.
> > > > > > >
> > > > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
> 下午2:05写道：
> > > > > > > >
> > > > > > > > +1
> > > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > > > "sql/test-only"
> > > > > > > -Phadoop-3.2
> > > > > > > >
> > > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > > > wrote:
> > > > > > > >
> > > > > > > >     Hi everyone,
> > > > > > > >
> > > > > > > >     I propose the following RC to be released as official
> > Apache
> > > > > > Parquet
> > > > > > > 1.11.0
> > > > > > > >     release.
> > > > > > > >
> > > > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > > > >     *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > > >
> > > > > > > >     The release tarball, signature, and checksums are here:
> > > > > > > >     *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > > >
> > > > > > > >     You can find the KEYS file here:
> > > > > > > >     *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > > >
> > > > > > > >     Binary artifacts are staged in Nexus here:
> > > > > > > >     *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > > >
> > > > > > > >     This release includes the changes listed at:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > > >
> > > > > > > >     Please download, verify, and test.
> > > > > > > >
> > > > > > > >     Please vote in the next 72 hours.
> > > > > > > >
> > > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > > >     [ ] +0
> > > > > > > >     [ ] -1 Do not release this because...
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Thanks Gabor for the explanation. I'd like to change my vote to +1
(non-binding).

Cheers, Fokko

Op di 19 nov. 2019 om 18:03 schreef Ryan Blue <rb...@netflix.com.invalid>

> Gabor, what I meant was: have we tried this with real data to see the
> effect? I think those results would be helpful.
>
> On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org>
> wrote:
>
> > Hi Ryan,
> >
> > It is not easy to calculate. For the column indexes feature we introduced
> > two new structures saved before the footer: column indexes and offset
> > indexes. If the min/max values are not too long, then the truncation
> might
> > not decrease the file size because of the offset indexes. Moreover, we
> also
> > introduced parquet.page.row.count.limit which might increase the number
> of
> > pages which leads to increasing the file size.
> > The footer itself is also changed and we are saving more values in it:
> the
> > offset values to the column/offset indexes, the new logical type
> > structures, the CRC checksums (we might have some others).
> > So, the size of the files with small amount of data will be increased
> > (because of the larger footer). The size of the files where the values
> can
> > be encoded very well (RLE) will probably be increased (because we will
> have
> > more pages). The size of some files where the values are long (>64bytes
> by
> > default) might be decreased because of truncating the min/max values.
> >
> > Regards,
> > Gabor
> >
> > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Gabor, do we have an idea of the additional overhead for a non-test
> data
> > > file? It should be easy to validate that this doesn't introduce an
> > > unreasonable amount of overhead. In some cases, it should actually be
> > > smaller since the column indexes are truncated and page stats are not.
> > >
> > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > <ga...@cloudera.com.invalid> wrote:
> > >
> > > > Hi Fokko,
> > > >
> > > > For the first point. The referenced constructor is private and
> Iceberg
> > > uses
> > > > it via reflection. It is not a breaking change. I think, parquet-mr
> > shall
> > > > not keep private methods only because of clients might use them via
> > > > reflection.
> > > >
> > > > About the checksum. I've agreed on having the CRC checksum write
> > enabled
> > > by
> > > > default because the benchmarks did not show significant performance
> > > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > > details.
> > > >
> > > > About the file size change. 1.11.0 is introducing column indexes, CRC
> > > > checksum, removing the statistics from the page headers and maybe
> other
> > > > changes that impact file size. If only file size is in question I
> > cannot
> > > > see a breaking change here.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > >
> > > >
> > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> <fokko@driesprong.frl
> > >
> > > > wrote:
> > > >
> > > > > Unfortunately, a -1 from my side (non-binding)
> > > > >
> > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > > >
> > > > >    - We've broken backward compatibility of the constructor of
> > > > >    ColumnChunkPageWriteStore
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > >.
> > > > >    This required a change
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > >
> > > > >    to the code. This isn't a hard blocker, but if there will be a
> new
> > > RC,
> > > > > I've
> > > > >    submitted a patch:
> https://github.com/apache/parquet-mr/pull/699
> > > > >    - Related, that we need to put in the changelog, is that
> checksums
> > > are
> > > > >    enabled by default:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > This
> > > > >    will impact performance. I would suggest disabling it by
> default:
> > > > >    https://github.com/apache/parquet-mr/pull/700
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > >
> > > > >    - Binary compatibility. While updating Iceberg, I've noticed
> that
> > > the
> > > > >    split-test was failing:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > The
> > > > >    two records are now divided over four Spark partitions.
> Something
> > in
> > > > the
> > > > >    output has changed since the files are bigger now. Has anyone
> any
> > > idea
> > > > > to
> > > > >    check what's changed, or a way to check this? The only thing I
> can
> > > > > think of
> > > > >    is the checksum mentioned above.
> > > > >
> > > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > >
> > > > > $ parquet-tools cat
> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > id = 1
> > > > > data = a
> > > > >
> > > > > $ parquet-tools cat
> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > id = 1
> > > > > data = a
> > > > >
> > > > > A binary diff here:
> > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > chenjunjiedada@gmail.com
> > > > > >:
> > > > >
> > > > > > +1
> > > > > > Verified signature, checksum and ran mvn install successfully.
> > > > > >
> > > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > > > > >
> > > > > > > +1
> > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > > "sql/test-only"
> > > > > > -Phadoop-3.2
> > > > > > >
> > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > >     Hi everyone,
> > > > > > >
> > > > > > >     I propose the following RC to be released as official
> Apache
> > > > > Parquet
> > > > > > 1.11.0
> > > > > > >     release.
> > > > > > >
> > > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > > >     *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > >
> > > > > > >     The release tarball, signature, and checksums are here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > >
> > > > > > >     You can find the KEYS file here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > >
> > > > > > >     Binary artifacts are staged in Nexus here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > >
> > > > > > >     This release includes the changes listed at:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > >
> > > > > > >     Please download, verify, and test.
> > > > > > >
> > > > > > >     Please vote in the next 72 hours.
> > > > > > >
> > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > >     [ ] +0
> > > > > > >     [ ] -1 Do not release this because...
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor, what I meant was: have we tried this with real data to see the
effect? I think those results would be helpful.

On Mon, Nov 18, 2019 at 11:35 PM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Ryan,
>
> It is not easy to calculate. For the column indexes feature we introduced
> two new structures saved before the footer: column indexes and offset
> indexes. If the min/max values are not too long, then the truncation might
> not decrease the file size because of the offset indexes. Moreover, we also
> introduced parquet.page.row.count.limit which might increase the number of
> pages which leads to increasing the file size.
> The footer itself is also changed and we are saving more values in it: the
> offset values to the column/offset indexes, the new logical type
> structures, the CRC checksums (we might have some others).
> So, the size of the files with small amount of data will be increased
> (because of the larger footer). The size of the files where the values can
> be encoded very well (RLE) will probably be increased (because we will have
> more pages). The size of some files where the values are long (>64bytes by
> default) might be decreased because of truncating the min/max values.
>
> Regards,
> Gabor
>
> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Gabor, do we have an idea of the additional overhead for a non-test data
> > file? It should be easy to validate that this doesn't introduce an
> > unreasonable amount of overhead. In some cases, it should actually be
> > smaller since the column indexes are truncated and page stats are not.
> >
> > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > <ga...@cloudera.com.invalid> wrote:
> >
> > > Hi Fokko,
> > >
> > > For the first point. The referenced constructor is private and Iceberg
> > uses
> > > it via reflection. It is not a breaking change. I think, parquet-mr
> shall
> > > not keep private methods only because of clients might use them via
> > > reflection.
> > >
> > > About the checksum. I've agreed on having the CRC checksum write
> enabled
> > by
> > > default because the benchmarks did not show significant performance
> > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > details.
> > >
> > > About the file size change. 1.11.0 is introducing column indexes, CRC
> > > checksum, removing the statistics from the page headers and maybe other
> > > changes that impact file size. If only file size is in question I
> cannot
> > > see a breaking change here.
> > >
> > > Regards,
> > > Gabor
> > >
> > >
> > >
> > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> > > wrote:
> > >
> > > > Unfortunately, a -1 from my side (non-binding)
> > > >
> > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > >
> > > >    - We've broken backward compatibility of the constructor of
> > > >    ColumnChunkPageWriteStore
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > >.
> > > >    This required a change
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > >
> > > >    to the code. This isn't a hard blocker, but if there will be a new
> > RC,
> > > > I've
> > > >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
> > > >    - Related, that we need to put in the changelog, is that checksums
> > are
> > > >    enabled by default:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > This
> > > >    will impact performance. I would suggest disabling it by default:
> > > >    https://github.com/apache/parquet-mr/pull/700
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > >
> > > >    - Binary compatibility. While updating Iceberg, I've noticed that
> > the
> > > >    split-test was failing:
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > The
> > > >    two records are now divided over four Spark partitions. Something
> in
> > > the
> > > >    output has changed since the files are bigger now. Has anyone any
> > idea
> > > > to
> > > >    check what's changed, or a way to check this? The only thing I can
> > > > think of
> > > >    is the checksum mentioned above.
> > > >
> > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > A binary diff here:
> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > >
> > > > Cheers, Fokko
> > > >
> > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > chenjunjiedada@gmail.com
> > > > >:
> > > >
> > > > > +1
> > > > > Verified signature, checksum and ran mvn install successfully.
> > > > >
> > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > > > >
> > > > > > +1
> > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > "sql/test-only"
> > > > > -Phadoop-3.2
> > > > > >
> > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > wrote:
> > > > > >
> > > > > >     Hi everyone,
> > > > > >
> > > > > >     I propose the following RC to be released as official Apache
> > > > Parquet
> > > > > 1.11.0
> > > > > >     release.
> > > > > >
> > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > >
> > > > > >     The release tarball, signature, and checksums are here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > >
> > > > > >     You can find the KEYS file here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > >
> > > > > >     Binary artifacts are staged in Nexus here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > >
> > > > > >     This release includes the changes listed at:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > >
> > > > > >     Please download, verify, and test.
> > > > > >
> > > > > >     Please vote in the next 72 hours.
> > > > > >
> > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > >     [ ] +0
> > > > > >     [ ] -1 Do not release this because...
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ismaël Mejía <ie...@gmail.com>.

Forgot to mention also moving master to 1.12.0-SNAPSHOT to validate that
all things transition ok too and that someone does not accidentally merge a
PR that does not end up in the correct branch.

On Tue, Nov 19, 2019 at 10:35 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +1
>
> Downloaded release code, checked hashes/signatures, run full tests and
> installed locally with zero errors. Tested integration on a downstream
> project (Apache Beam) and no issues (Note that we don't use any of the new
> features yet).
>
> Gabor, can you please create a corresponding parquet-1.11.x branch. I
> expected to compare the release with the branch and tag but I found the
> branch is not present.
>
> Thanks,
> Ismaël
>
>
>
> On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
>> Hi Ryan,
>>
>> It is not easy to calculate. For the column indexes feature we introduced
>> two new structures saved before the footer: column indexes and offset
>> indexes. If the min/max values are not too long, then the truncation might
>> not decrease the file size because of the offset indexes. Moreover, we
>> also
>> introduced parquet.page.row.count.limit which might increase the number of
>> pages which leads to increasing the file size.
>> The footer itself is also changed and we are saving more values in it: the
>> offset values to the column/offset indexes, the new logical type
>> structures, the CRC checksums (we might have some others).
>> So, the size of the files with small amount of data will be increased
>> (because of the larger footer). The size of the files where the values can
>> be encoded very well (RLE) will probably be increased (because we will
>> have
>> more pages). The size of some files where the values are long (>64bytes by
>> default) might be decreased because of truncating the min/max values.
>>
>> Regards,
>> Gabor
>>
>> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> > Gabor, do we have an idea of the additional overhead for a non-test data
>> > file? It should be easy to validate that this doesn't introduce an
>> > unreasonable amount of overhead. In some cases, it should actually be
>> > smaller since the column indexes are truncated and page stats are not.
>> >
>> > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>> > <ga...@cloudera.com.invalid> wrote:
>> >
>> > > Hi Fokko,
>> > >
>> > > For the first point. The referenced constructor is private and Iceberg
>> > uses
>> > > it via reflection. It is not a breaking change. I think, parquet-mr
>> shall
>> > > not keep private methods only because of clients might use them via
>> > > reflection.
>> > >
>> > > About the checksum. I've agreed on having the CRC checksum write
>> enabled
>> > by
>> > > default because the benchmarks did not show significant performance
>> > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
>> > details.
>> > >
>> > > About the file size change. 1.11.0 is introducing column indexes, CRC
>> > > checksum, removing the statistics from the page headers and maybe
>> other
>> > > changes that impact file size. If only file size is in question I
>> cannot
>> > > see a breaking change here.
>> > >
>> > > Regards,
>> > > Gabor
>> > >
>> > >
>> > >
>> > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>> <fo...@driesprong.frl>
>> > > wrote:
>> > >
>> > > > Unfortunately, a -1 from my side (non-binding)
>> > > >
>> > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
>> > > >
>> > > >    - We've broken backward compatibility of the constructor of
>> > > >    ColumnChunkPageWriteStore
>> > > >    <
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>> > > > >.
>> > > >    This required a change
>> > > >    <
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>> > > > >
>> > > >    to the code. This isn't a hard blocker, but if there will be a
>> new
>> > RC,
>> > > > I've
>> > > >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
>> > > >    - Related, that we need to put in the changelog, is that
>> checksums
>> > are
>> > > >    enabled by default:
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>> > > > This
>> > > >    will impact performance. I would suggest disabling it by default:
>> > > >    https://github.com/apache/parquet-mr/pull/700
>> > > >    <
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>> > > > >
>> > > >    - Binary compatibility. While updating Iceberg, I've noticed that
>> > the
>> > > >    split-test was failing:
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>> > > > The
>> > > >    two records are now divided over four Spark partitions.
>> Something in
>> > > the
>> > > >    output has changed since the files are bigger now. Has anyone any
>> > idea
>> > > > to
>> > > >    check what's changed, or a way to check this? The only thing I
>> can
>> > > > think of
>> > > >    is the checksum mentioned above.
>> > > >
>> > > > $ ls -lah ~/Desktop/parquet-1-1*
>> > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>> > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>> > > >
>> > > > $ parquet-tools cat
>> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>> > > > id = 1
>> > > > data = a
>> > > >
>> > > > $ parquet-tools cat
>> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>> > > > id = 1
>> > > > data = a
>> > > >
>> > > > A binary diff here:
>> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>> > > >
>> > > > Cheers, Fokko
>> > > >
>> > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>> > > chenjunjiedada@gmail.com
>> > > > >:
>> > > >
>> > > > > +1
>> > > > > Verified signature, checksum and ran mvn install successfully.
>> > > > >
>> > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
>> > > > > >
>> > > > > > +1
>> > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>> > > "sql/test-only"
>> > > > > -Phadoop-3.2
>> > > > > >
>> > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
>> > wrote:
>> > > > > >
>> > > > > >     Hi everyone,
>> > > > > >
>> > > > > >     I propose the following RC to be released as official Apache
>> > > > Parquet
>> > > > > 1.11.0
>> > > > > >     release.
>> > > > > >
>> > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>> > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
>> > > > > >     *
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>> > > > > >
>> > > > > >     The release tarball, signature, and checksums are here:
>> > > > > >     *
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>> > > > > >
>> > > > > >     You can find the KEYS file here:
>> > > > > >     *
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>> > > > > >
>> > > > > >     Binary artifacts are staged in Nexus here:
>> > > > > >     *
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>> > > > > >
>> > > > > >     This release includes the changes listed at:
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>> > > > > >
>> > > > > >     Please download, verify, and test.
>> > > > > >
>> > > > > >     Please vote in the next 72 hours.
>> > > > > >
>> > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
>> > > > > >     [ ] +0
>> > > > > >     [ ] -1 Do not release this because...
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Not changing the version to 1.12 was also intentional. Until we have a
successful vote for 1.11.0 it is not released and therefore we are still
working on 1.11. I'll upgrade the version to 1.12.0-SNAPSHOT after 1.11.0
is released.


On Tue, Nov 19, 2019 at 11:04 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Ismaël,
>
> Thanks for checking the release.
>
> Branch was not created because usually we release the current master for
> major/minor releases. Everything was on master is part of the current RC.
> That was what we agreed on at the last parquet sync. We usually create
> branches for patch releases because we are adding only some controlled
> fixes.
> What would you like to compare? The current RC and previous one have they
> own tags already. For the previous RCs you need the check the previous
> voting mails.
>
> Regards,
> Gabor
>
> On Tue, Nov 19, 2019 at 10:35 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1
>>
>> Downloaded release code, checked hashes/signatures, run full tests and
>> installed locally with zero errors. Tested integration on a downstream
>> project (Apache Beam) and no issues (Note that we don't use any of the new
>> features yet).
>>
>> Gabor, can you please create a corresponding parquet-1.11.x branch. I
>> expected to compare the release with the branch and tag but I found the
>> branch is not present.
>>
>> Thanks,
>> Ismaël
>>
>>
>>
>> On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org>
>> wrote:
>>
>> > Hi Ryan,
>> >
>> > It is not easy to calculate. For the column indexes feature we
>> introduced
>> > two new structures saved before the footer: column indexes and offset
>> > indexes. If the min/max values are not too long, then the truncation
>> might
>> > not decrease the file size because of the offset indexes. Moreover, we
>> also
>> > introduced parquet.page.row.count.limit which might increase the number
>> of
>> > pages which leads to increasing the file size.
>> > The footer itself is also changed and we are saving more values in it:
>> the
>> > offset values to the column/offset indexes, the new logical type
>> > structures, the CRC checksums (we might have some others).
>> > So, the size of the files with small amount of data will be increased
>> > (because of the larger footer). The size of the files where the values
>> can
>> > be encoded very well (RLE) will probably be increased (because we will
>> have
>> > more pages). The size of some files where the values are long (>64bytes
>> by
>> > default) might be decreased because of truncating the min/max values.
>> >
>> > Regards,
>> > Gabor
>> >
>> > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
>> > wrote:
>> >
>> > > Gabor, do we have an idea of the additional overhead for a non-test
>> data
>> > > file? It should be easy to validate that this doesn't introduce an
>> > > unreasonable amount of overhead. In some cases, it should actually be
>> > > smaller since the column indexes are truncated and page stats are not.
>> > >
>> > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
>> > > <ga...@cloudera.com.invalid> wrote:
>> > >
>> > > > Hi Fokko,
>> > > >
>> > > > For the first point. The referenced constructor is private and
>> Iceberg
>> > > uses
>> > > > it via reflection. It is not a breaking change. I think, parquet-mr
>> > shall
>> > > > not keep private methods only because of clients might use them via
>> > > > reflection.
>> > > >
>> > > > About the checksum. I've agreed on having the CRC checksum write
>> > enabled
>> > > by
>> > > > default because the benchmarks did not show significant performance
>> > > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
>> > > details.
>> > > >
>> > > > About the file size change. 1.11.0 is introducing column indexes,
>> CRC
>> > > > checksum, removing the statistics from the page headers and maybe
>> other
>> > > > changes that impact file size. If only file size is in question I
>> > cannot
>> > > > see a breaking change here.
>> > > >
>> > > > Regards,
>> > > > Gabor
>> > > >
>> > > >
>> > > >
>> > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
>> <fokko@driesprong.frl
>> > >
>> > > > wrote:
>> > > >
>> > > > > Unfortunately, a -1 from my side (non-binding)
>> > > > >
>> > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
>> > > > >
>> > > > >    - We've broken backward compatibility of the constructor of
>> > > > >    ColumnChunkPageWriteStore
>> > > > >    <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
>> > > > > >.
>> > > > >    This required a change
>> > > > >    <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
>> > > > > >
>> > > > >    to the code. This isn't a hard blocker, but if there will be a
>> new
>> > > RC,
>> > > > > I've
>> > > > >    submitted a patch:
>> https://github.com/apache/parquet-mr/pull/699
>> > > > >    - Related, that we need to put in the changelog, is that
>> checksums
>> > > are
>> > > > >    enabled by default:
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
>> > > > > This
>> > > > >    will impact performance. I would suggest disabling it by
>> default:
>> > > > >    https://github.com/apache/parquet-mr/pull/700
>> > > > >    <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
>> > > > > >
>> > > > >    - Binary compatibility. While updating Iceberg, I've noticed
>> that
>> > > the
>> > > > >    split-test was failing:
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
>> > > > > The
>> > > > >    two records are now divided over four Spark partitions.
>> Something
>> > in
>> > > > the
>> > > > >    output has changed since the files are bigger now. Has anyone
>> any
>> > > idea
>> > > > > to
>> > > > >    check what's changed, or a way to check this? The only thing I
>> can
>> > > > > think of
>> > > > >    is the checksum mentioned above.
>> > > > >
>> > > > > $ ls -lah ~/Desktop/parquet-1-1*
>> > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
>> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>> > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
>> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>> > > > >
>> > > > > $ parquet-tools cat
>> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
>> > > > > id = 1
>> > > > > data = a
>> > > > >
>> > > > > $ parquet-tools cat
>> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>> > > > > id = 1
>> > > > > data = a
>> > > > >
>> > > > > A binary diff here:
>> > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>> > > > >
>> > > > > Cheers, Fokko
>> > > > >
>> > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
>> > > > chenjunjiedada@gmail.com
>> > > > > >:
>> > > > >
>> > > > > > +1
>> > > > > > Verified signature, checksum and ran mvn install successfully.
>> > > > > >
>> > > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四
>> 下午2:05写道：
>> > > > > > >
>> > > > > > > +1
>> > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
>> > > > "sql/test-only"
>> > > > > > -Phadoop-3.2
>> > > > > > >
>> > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
>> > > wrote:
>> > > > > > >
>> > > > > > >     Hi everyone,
>> > > > > > >
>> > > > > > >     I propose the following RC to be released as official
>> Apache
>> > > > > Parquet
>> > > > > > 1.11.0
>> > > > > > >     release.
>> > > > > > >
>> > > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>> > > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
>> > > > > > >     *
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>> > > > > > >
>> > > > > > >     The release tarball, signature, and checksums are here:
>> > > > > > >     *
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>> > > > > > >
>> > > > > > >     You can find the KEYS file here:
>> > > > > > >     *
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>> > > > > > >
>> > > > > > >     Binary artifacts are staged in Nexus here:
>> > > > > > >     *
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>> > > > > > >
>> > > > > > >     This release includes the changes listed at:
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>> > > > > > >
>> > > > > > >     Please download, verify, and test.
>> > > > > > >
>> > > > > > >     Please vote in the next 72 hours.
>> > > > > > >
>> > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
>> > > > > > >     [ ] +0
>> > > > > > >     [ ] -1 Do not release this because...
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Ryan Blue
>> > > Software Engineer
>> > > Netflix
>> > >
>> >
>>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Ismaël,

Thanks for checking the release.

Branch was not created because usually we release the current master for
major/minor releases. Everything was on master is part of the current RC.
That was what we agreed on at the last parquet sync. We usually create
branches for patch releases because we are adding only some controlled
fixes.
What would you like to compare? The current RC and previous one have they
own tags already. For the previous RCs you need the check the previous
voting mails.

Regards,
Gabor

On Tue, Nov 19, 2019 at 10:35 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +1
>
> Downloaded release code, checked hashes/signatures, run full tests and
> installed locally with zero errors. Tested integration on a downstream
> project (Apache Beam) and no issues (Note that we don't use any of the new
> features yet).
>
> Gabor, can you please create a corresponding parquet-1.11.x branch. I
> expected to compare the release with the branch and tag but I found the
> branch is not present.
>
> Thanks,
> Ismaël
>
>
>
> On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Hi Ryan,
> >
> > It is not easy to calculate. For the column indexes feature we introduced
> > two new structures saved before the footer: column indexes and offset
> > indexes. If the min/max values are not too long, then the truncation
> might
> > not decrease the file size because of the offset indexes. Moreover, we
> also
> > introduced parquet.page.row.count.limit which might increase the number
> of
> > pages which leads to increasing the file size.
> > The footer itself is also changed and we are saving more values in it:
> the
> > offset values to the column/offset indexes, the new logical type
> > structures, the CRC checksums (we might have some others).
> > So, the size of the files with small amount of data will be increased
> > (because of the larger footer). The size of the files where the values
> can
> > be encoded very well (RLE) will probably be increased (because we will
> have
> > more pages). The size of some files where the values are long (>64bytes
> by
> > default) might be decreased because of truncating the min/max values.
> >
> > Regards,
> > Gabor
> >
> > On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Gabor, do we have an idea of the additional overhead for a non-test
> data
> > > file? It should be easy to validate that this doesn't introduce an
> > > unreasonable amount of overhead. In some cases, it should actually be
> > > smaller since the column indexes are truncated and page stats are not.
> > >
> > > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > > <ga...@cloudera.com.invalid> wrote:
> > >
> > > > Hi Fokko,
> > > >
> > > > For the first point. The referenced constructor is private and
> Iceberg
> > > uses
> > > > it via reflection. It is not a breaking change. I think, parquet-mr
> > shall
> > > > not keep private methods only because of clients might use them via
> > > > reflection.
> > > >
> > > > About the checksum. I've agreed on having the CRC checksum write
> > enabled
> > > by
> > > > default because the benchmarks did not show significant performance
> > > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > > details.
> > > >
> > > > About the file size change. 1.11.0 is introducing column indexes, CRC
> > > > checksum, removing the statistics from the page headers and maybe
> other
> > > > changes that impact file size. If only file size is in question I
> > cannot
> > > > see a breaking change here.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > >
> > > >
> > > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko
> <fokko@driesprong.frl
> > >
> > > > wrote:
> > > >
> > > > > Unfortunately, a -1 from my side (non-binding)
> > > > >
> > > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > > >
> > > > >    - We've broken backward compatibility of the constructor of
> > > > >    ColumnChunkPageWriteStore
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > > >.
> > > > >    This required a change
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > > >
> > > > >    to the code. This isn't a hard blocker, but if there will be a
> new
> > > RC,
> > > > > I've
> > > > >    submitted a patch:
> https://github.com/apache/parquet-mr/pull/699
> > > > >    - Related, that we need to put in the changelog, is that
> checksums
> > > are
> > > > >    enabled by default:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > > This
> > > > >    will impact performance. I would suggest disabling it by
> default:
> > > > >    https://github.com/apache/parquet-mr/pull/700
> > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > > >
> > > > >    - Binary compatibility. While updating Iceberg, I've noticed
> that
> > > the
> > > > >    split-test was failing:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > > The
> > > > >    two records are now divided over four Spark partitions.
> Something
> > in
> > > > the
> > > > >    output has changed since the files are bigger now. Has anyone
> any
> > > idea
> > > > > to
> > > > >    check what's changed, or a way to check this? The only thing I
> can
> > > > > think of
> > > > >    is the checksum mentioned above.
> > > > >
> > > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > >
> > > > > $ parquet-tools cat
> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > > id = 1
> > > > > data = a
> > > > >
> > > > > $ parquet-tools cat
> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > > id = 1
> > > > > data = a
> > > > >
> > > > > A binary diff here:
> > > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > > chenjunjiedada@gmail.com
> > > > > >:
> > > > >
> > > > > > +1
> > > > > > Verified signature, checksum and ran mvn install successfully.
> > > > > >
> > > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > > > > >
> > > > > > > +1
> > > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > > "sql/test-only"
> > > > > > -Phadoop-3.2
> > > > > > >
> > > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > >     Hi everyone,
> > > > > > >
> > > > > > >     I propose the following RC to be released as official
> Apache
> > > > > Parquet
> > > > > > 1.11.0
> > > > > > >     release.
> > > > > > >
> > > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > > >     *
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > > >
> > > > > > >     The release tarball, signature, and checksums are here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > > >
> > > > > > >     You can find the KEYS file here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > > >
> > > > > > >     Binary artifacts are staged in Nexus here:
> > > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > > >
> > > > > > >     This release includes the changes listed at:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > > >
> > > > > > >     Please download, verify, and test.
> > > > > > >
> > > > > > >     Please vote in the next 72 hours.
> > > > > > >
> > > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > > >     [ ] +0
> > > > > > >     [ ] -1 Do not release this because...
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ismaël Mejía <ie...@gmail.com>.

+1

Downloaded release code, checked hashes/signatures, run full tests and
installed locally with zero errors. Tested integration on a downstream
project (Apache Beam) and no issues (Note that we don't use any of the new
features yet).

Gabor, can you please create a corresponding parquet-1.11.x branch. I
expected to compare the release with the branch and tag but I found the
branch is not present.

Thanks,
Ismaël



On Tue, Nov 19, 2019 at 8:35 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Ryan,
>
> It is not easy to calculate. For the column indexes feature we introduced
> two new structures saved before the footer: column indexes and offset
> indexes. If the min/max values are not too long, then the truncation might
> not decrease the file size because of the offset indexes. Moreover, we also
> introduced parquet.page.row.count.limit which might increase the number of
> pages which leads to increasing the file size.
> The footer itself is also changed and we are saving more values in it: the
> offset values to the column/offset indexes, the new logical type
> structures, the CRC checksums (we might have some others).
> So, the size of the files with small amount of data will be increased
> (because of the larger footer). The size of the files where the values can
> be encoded very well (RLE) will probably be increased (because we will have
> more pages). The size of some files where the values are long (>64bytes by
> default) might be decreased because of truncating the min/max values.
>
> Regards,
> Gabor
>
> On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Gabor, do we have an idea of the additional overhead for a non-test data
> > file? It should be easy to validate that this doesn't introduce an
> > unreasonable amount of overhead. In some cases, it should actually be
> > smaller since the column indexes are truncated and page stats are not.
> >
> > On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> > <ga...@cloudera.com.invalid> wrote:
> >
> > > Hi Fokko,
> > >
> > > For the first point. The referenced constructor is private and Iceberg
> > uses
> > > it via reflection. It is not a breaking change. I think, parquet-mr
> shall
> > > not keep private methods only because of clients might use them via
> > > reflection.
> > >
> > > About the checksum. I've agreed on having the CRC checksum write
> enabled
> > by
> > > default because the benchmarks did not show significant performance
> > > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> > details.
> > >
> > > About the file size change. 1.11.0 is introducing column indexes, CRC
> > > checksum, removing the statistics from the page headers and maybe other
> > > changes that impact file size. If only file size is in question I
> cannot
> > > see a breaking change here.
> > >
> > > Regards,
> > > Gabor
> > >
> > >
> > >
> > > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fokko@driesprong.frl
> >
> > > wrote:
> > >
> > > > Unfortunately, a -1 from my side (non-binding)
> > > >
> > > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > > >
> > > >    - We've broken backward compatibility of the constructor of
> > > >    ColumnChunkPageWriteStore
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > > >.
> > > >    This required a change
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > > >
> > > >    to the code. This isn't a hard blocker, but if there will be a new
> > RC,
> > > > I've
> > > >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
> > > >    - Related, that we need to put in the changelog, is that checksums
> > are
> > > >    enabled by default:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > > This
> > > >    will impact performance. I would suggest disabling it by default:
> > > >    https://github.com/apache/parquet-mr/pull/700
> > > >    <
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > > >
> > > >    - Binary compatibility. While updating Iceberg, I've noticed that
> > the
> > > >    split-test was failing:
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > > The
> > > >    two records are now divided over four Spark partitions. Something
> in
> > > the
> > > >    output has changed since the files are bigger now. Has anyone any
> > idea
> > > > to
> > > >    check what's changed, or a way to check this? The only thing I can
> > > > think of
> > > >    is the checksum mentioned above.
> > > >
> > > > $ ls -lah ~/Desktop/parquet-1-1*
> > > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > $ parquet-tools cat
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > > id = 1
> > > > data = a
> > > >
> > > > A binary diff here:
> > > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > > >
> > > > Cheers, Fokko
> > > >
> > > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > > chenjunjiedada@gmail.com
> > > > >:
> > > >
> > > > > +1
> > > > > Verified signature, checksum and ran mvn install successfully.
> > > > >
> > > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > > > >
> > > > > > +1
> > > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > > "sql/test-only"
> > > > > -Phadoop-3.2
> > > > > >
> > > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> > wrote:
> > > > > >
> > > > > >     Hi everyone,
> > > > > >
> > > > > >     I propose the following RC to be released as official Apache
> > > > Parquet
> > > > > 1.11.0
> > > > > >     release.
> > > > > >
> > > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > > >     *
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > > >
> > > > > >     The release tarball, signature, and checksums are here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > > >
> > > > > >     You can find the KEYS file here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > > >
> > > > > >     Binary artifacts are staged in Nexus here:
> > > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > > >
> > > > > >     This release includes the changes listed at:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > > >
> > > > > >     Please download, verify, and test.
> > > > > >
> > > > > >     Please vote in the next 72 hours.
> > > > > >
> > > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > > >     [ ] +0
> > > > > >     [ ] -1 Do not release this because...
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@apache.org>.

Hi Ryan,

It is not easy to calculate. For the column indexes feature we introduced
two new structures saved before the footer: column indexes and offset
indexes. If the min/max values are not too long, then the truncation might
not decrease the file size because of the offset indexes. Moreover, we also
introduced parquet.page.row.count.limit which might increase the number of
pages which leads to increasing the file size.
The footer itself is also changed and we are saving more values in it: the
offset values to the column/offset indexes, the new logical type
structures, the CRC checksums (we might have some others).
So, the size of the files with small amount of data will be increased
(because of the larger footer). The size of the files where the values can
be encoded very well (RLE) will probably be increased (because we will have
more pages). The size of some files where the values are long (>64bytes by
default) might be decreased because of truncating the min/max values.

Regards,
Gabor

On Mon, Nov 18, 2019 at 6:46 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Gabor, do we have an idea of the additional overhead for a non-test data
> file? It should be easy to validate that this doesn't introduce an
> unreasonable amount of overhead. In some cases, it should actually be
> smaller since the column indexes are truncated and page stats are not.
>
> On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
> <ga...@cloudera.com.invalid> wrote:
>
> > Hi Fokko,
> >
> > For the first point. The referenced constructor is private and Iceberg
> uses
> > it via reflection. It is not a breaking change. I think, parquet-mr shall
> > not keep private methods only because of clients might use them via
> > reflection.
> >
> > About the checksum. I've agreed on having the CRC checksum write enabled
> by
> > default because the benchmarks did not show significant performance
> > penalties. See https://github.com/apache/parquet-mr/pull/647 for
> details.
> >
> > About the file size change. 1.11.0 is introducing column indexes, CRC
> > checksum, removing the statistics from the page headers and maybe other
> > changes that impact file size. If only file size is in question I cannot
> > see a breaking change here.
> >
> > Regards,
> > Gabor
> >
> >
> >
> > On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fo...@driesprong.frl>
> > wrote:
> >
> > > Unfortunately, a -1 from my side (non-binding)
> > >
> > > I've updated Iceberg to Parquet 1.11.0, and found three things:
> > >
> > >    - We've broken backward compatibility of the constructor of
> > >    ColumnChunkPageWriteStore
> > >    <
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > > >.
> > >    This required a change
> > >    <
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > > >
> > >    to the code. This isn't a hard blocker, but if there will be a new
> RC,
> > > I've
> > >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
> > >    - Related, that we need to put in the changelog, is that checksums
> are
> > >    enabled by default:
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > > This
> > >    will impact performance. I would suggest disabling it by default:
> > >    https://github.com/apache/parquet-mr/pull/700
> > >    <
> > >
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > > >
> > >    - Binary compatibility. While updating Iceberg, I've noticed that
> the
> > >    split-test was failing:
> > >
> > >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > > The
> > >    two records are now divided over four Spark partitions. Something in
> > the
> > >    output has changed since the files are bigger now. Has anyone any
> idea
> > > to
> > >    check what's changed, or a way to check this? The only thing I can
> > > think of
> > >    is the checksum mentioned above.
> > >
> > > $ ls -lah ~/Desktop/parquet-1-1*
> > > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > >
> > > $ parquet-tools cat
> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > > id = 1
> > > data = a
> > >
> > > $ parquet-tools cat
> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > > id = 1
> > > data = a
> > >
> > > A binary diff here:
> > > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> > >
> > > Cheers, Fokko
> > >
> > > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> > chenjunjiedada@gmail.com
> > > >:
> > >
> > > > +1
> > > > Verified signature, checksum and ran mvn install successfully.
> > > >
> > > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > > >
> > > > > +1
> > > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> > "sql/test-only"
> > > > -Phadoop-3.2
> > > > >
> > > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org>
> wrote:
> > > > >
> > > > >     Hi everyone,
> > > > >
> > > > >     I propose the following RC to be released as official Apache
> > > Parquet
> > > > 1.11.0
> > > > >     release.
> > > > >
> > > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > > >     *
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > > >
> > > > >     The release tarball, signature, and checksums are here:
> > > > >     *
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > > >
> > > > >     You can find the KEYS file here:
> > > > >     *
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > > >
> > > > >     Binary artifacts are staged in Nexus here:
> > > > >     *
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > > >
> > > > >     This release includes the changes listed at:
> > > > >
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > > >
> > > > >     Please download, verify, and test.
> > > > >
> > > > >     Please vote in the next 72 hours.
> > > > >
> > > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > > >     [ ] +0
> > > > >     [ ] -1 Do not release this because...
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor, do we have an idea of the additional overhead for a non-test data
file? It should be easy to validate that this doesn't introduce an
unreasonable amount of overhead. In some cases, it should actually be
smaller since the column indexes are truncated and page stats are not.

On Mon, Nov 18, 2019 at 1:00 AM Gabor Szadovszky
<ga...@cloudera.com.invalid> wrote:

> Hi Fokko,
>
> For the first point. The referenced constructor is private and Iceberg uses
> it via reflection. It is not a breaking change. I think, parquet-mr shall
> not keep private methods only because of clients might use them via
> reflection.
>
> About the checksum. I've agreed on having the CRC checksum write enabled by
> default because the benchmarks did not show significant performance
> penalties. See https://github.com/apache/parquet-mr/pull/647 for details.
>
> About the file size change. 1.11.0 is introducing column indexes, CRC
> checksum, removing the statistics from the page headers and maybe other
> changes that impact file size. If only file size is in question I cannot
> see a breaking change here.
>
> Regards,
> Gabor
>
>
>
> On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
> > Unfortunately, a -1 from my side (non-binding)
> >
> > I've updated Iceberg to Parquet 1.11.0, and found three things:
> >
> >    - We've broken backward compatibility of the constructor of
> >    ColumnChunkPageWriteStore
> >    <
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> > >.
> >    This required a change
> >    <
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> > >
> >    to the code. This isn't a hard blocker, but if there will be a new RC,
> > I've
> >    submitted a patch: https://github.com/apache/parquet-mr/pull/699
> >    - Related, that we need to put in the changelog, is that checksums are
> >    enabled by default:
> >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> > This
> >    will impact performance. I would suggest disabling it by default:
> >    https://github.com/apache/parquet-mr/pull/700
> >    <
> >
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> > >
> >    - Binary compatibility. While updating Iceberg, I've noticed that the
> >    split-test was failing:
> >
> >
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> > The
> >    two records are now divided over four Spark partitions. Something in
> the
> >    output has changed since the files are bigger now. Has anyone any idea
> > to
> >    check what's changed, or a way to check this? The only thing I can
> > think of
> >    is the checksum mentioned above.
> >
> > $ ls -lah ~/Desktop/parquet-1-1*
> > -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> > /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> > /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> >
> > $ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> > id = 1
> > data = a
> >
> > $ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> > id = 1
> > data = a
> >
> > A binary diff here:
> > https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
> >
> > Cheers, Fokko
> >
> > Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <
> chenjunjiedada@gmail.com
> > >:
> >
> > > +1
> > > Verified signature, checksum and ran mvn install successfully.
> > >
> > > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > > >
> > > > +1
> > > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt
> "sql/test-only"
> > > -Phadoop-3.2
> > > >
> > > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> wrote:
> > > >
> > > >     Hi everyone,
> > > >
> > > >     I propose the following RC to be released as official Apache
> > Parquet
> > > 1.11.0
> > > >     release.
> > > >
> > > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > > >     *
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > > >
> > > >     The release tarball, signature, and checksums are here:
> > > >     *
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > > >
> > > >     You can find the KEYS file here:
> > > >     *
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > > >
> > > >     Binary artifacts are staged in Nexus here:
> > > >     *
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > > >
> > > >     This release includes the changes listed at:
> > > >
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > > >
> > > >     Please download, verify, and test.
> > > >
> > > >     Please vote in the next 72 hours.
> > > >
> > > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > > >     [ ] +0
> > > >     [ ] -1 Do not release this because...
> > > >
> > > >
> > >
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.

Hi Fokko,

For the first point. The referenced constructor is private and Iceberg uses
it via reflection. It is not a breaking change. I think, parquet-mr shall
not keep private methods only because of clients might use them via
reflection.

About the checksum. I've agreed on having the CRC checksum write enabled by
default because the benchmarks did not show significant performance
penalties. See https://github.com/apache/parquet-mr/pull/647 for details.

About the file size change. 1.11.0 is introducing column indexes, CRC
checksum, removing the statistics from the page headers and maybe other
changes that impact file size. If only file size is in question I cannot
see a breaking change here.

Regards,
Gabor



On Sun, Nov 17, 2019 at 9:27 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Unfortunately, a -1 from my side (non-binding)
>
> I've updated Iceberg to Parquet 1.11.0, and found three things:
>
>    - We've broken backward compatibility of the constructor of
>    ColumnChunkPageWriteStore
>    <
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80
> >.
>    This required a change
>    <
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176
> >
>    to the code. This isn't a hard blocker, but if there will be a new RC,
> I've
>    submitted a patch: https://github.com/apache/parquet-mr/pull/699
>    - Related, that we need to put in the changelog, is that checksums are
>    enabled by default:
>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
> This
>    will impact performance. I would suggest disabling it by default:
>    https://github.com/apache/parquet-mr/pull/700
>    <
> https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277
> >
>    - Binary compatibility. While updating Iceberg, I've noticed that the
>    split-test was failing:
>
> https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
> The
>    two records are now divided over four Spark partitions. Something in the
>    output has changed since the files are bigger now. Has anyone any idea
> to
>    check what's changed, or a way to check this? The only thing I can
> think of
>    is the checksum mentioned above.
>
> $ ls -lah ~/Desktop/parquet-1-1*
> -rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
> /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> -rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
> /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
>
> $ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
> id = 1
> data = a
>
> $ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
> id = 1
> data = a
>
> A binary diff here:
> https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8
>
> Cheers, Fokko
>
> Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <chenjunjiedada@gmail.com
> >:
>
> > +1
> > Verified signature, checksum and ran mvn install successfully.
> >
> > Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> > >
> > > +1
> > > Tested Parquet 1.11.0 with Spark SQL module: build/sbt "sql/test-only"
> > -Phadoop-3.2
> > >
> > > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> wrote:
> > >
> > >     Hi everyone,
> > >
> > >     I propose the following RC to be released as official Apache
> Parquet
> > 1.11.0
> > >     release.
> > >
> > >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> > >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> > >     *
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> > >
> > >     The release tarball, signature, and checksums are here:
> > >     *
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> > >
> > >     You can find the KEYS file here:
> > >     *
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> > >
> > >     Binary artifacts are staged in Nexus here:
> > >     *
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> > >
> > >     This release includes the changes listed at:
> > >
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> > >
> > >     Please download, verify, and test.
> > >
> > >     Please vote in the next 72 hours.
> > >
> > >     [ ] +1 Release this as Apache Parquet 1.11.0
> > >     [ ] +0
> > >     [ ] -1 Do not release this because...
> > >
> > >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Unfortunately, a -1 from my side (non-binding)

I've updated Iceberg to Parquet 1.11.0, and found three things:

   - We've broken backward compatibility of the constructor of
   ColumnChunkPageWriteStore
   <https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR80>.
   This required a change
   <https://github.com/apache/incubator-iceberg/pull/297/files#diff-b877faa96f292b851c75fe8bcc1912f8R176>
   to the code. This isn't a hard blocker, but if there will be a new RC, I've
   submitted a patch: https://github.com/apache/parquet-mr/pull/699
   - Related, that we need to put in the changelog, is that checksums are
   enabled by default:
   https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54
This
   will impact performance. I would suggest disabling it by default:
   https://github.com/apache/parquet-mr/pull/700
   <https://github.com/apache/parquet-mr/commit/e7db9e20f52c925a207ea62d6dda6dc4e870294e#diff-d007a18083a2431c30a5416f248e0a4bR277>
   - Binary compatibility. While updating Iceberg, I've noticed that the
   split-test was failing:
   https://github.com/apache/incubator-iceberg/pull/297/files#diff-4b64b7014f259be41b26cfb73d3e6e93L199
The
   two records are now divided over four Spark partitions. Something in the
   output has changed since the files are bigger now. Has anyone any idea to
   check what's changed, or a way to check this? The only thing I can think of
   is the checksum mentioned above.

$ ls -lah ~/Desktop/parquet-1-1*
-rw-r--r--  1 fokkodriesprong  staff   562B 17 nov 21:09
/Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
-rw-r--r--  1 fokkodriesprong  staff   611B 17 nov 21:05
/Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet

$ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-10-1.parquet
id = 1
data = a

$ parquet-tools cat /Users/fokkodriesprong/Desktop/parquet-1-11-0.parquet
id = 1
data = a

A binary diff here:
https://gist.github.com/Fokko/1c209f158299dc2fb5878c5bae4bf6d8

Cheers, Fokko

Op za 16 nov. 2019 om 04:18 schreef Junjie Chen <ch...@gmail.com>:

> +1
> Verified signature, checksum and ran mvn install successfully.
>
> Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
> >
> > +1
> > Tested Parquet 1.11.0 with Spark SQL module: build/sbt "sql/test-only"
> -Phadoop-3.2
> >
> > On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> wrote:
> >
> >     Hi everyone,
> >
> >     I propose the following RC to be released as official Apache Parquet
> 1.11.0
> >     release.
> >
> >     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
> >     * This corresponds to the tag: apache-parquet-1.11.0-rc7
> >     *
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
> >
> >     The release tarball, signature, and checksums are here:
> >     *
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
> >
> >     You can find the KEYS file here:
> >     *
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
> >
> >     Binary artifacts are staged in Nexus here:
> >     *
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
> >
> >     This release includes the changes listed at:
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
> >
> >     Please download, verify, and test.
> >
> >     Please vote in the next 72 hours.
> >
> >     [ ] +1 Release this as Apache Parquet 1.11.0
> >     [ ] +0
> >     [ ] -1 Do not release this because...
> >
> >
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by Junjie Chen <ch...@gmail.com>.

+1
Verified signature, checksum and ran mvn install successfully.

Wang, Yuming <yu...@ebay.com.invalid> 于2019年11月14日周四 下午2:05写道：
>
> +1
> Tested Parquet 1.11.0 with Spark SQL module: build/sbt "sql/test-only" -Phadoop-3.2
>
> On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> wrote:
>
>     Hi everyone,
>
>     I propose the following RC to be released as official Apache Parquet 1.11.0
>     release.
>
>     The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
>     * This corresponds to the tag: apache-parquet-1.11.0-rc7
>     *
>     https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
>
>     The release tarball, signature, and checksums are here:
>     * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
>
>     You can find the KEYS file here:
>     * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
>
>     Binary artifacts are staged in Nexus here:
>     * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
>
>     This release includes the changes listed at:
>     https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
>
>     Please download, verify, and test.
>
>     Please vote in the next 72 hours.
>
>     [ ] +1 Release this as Apache Parquet 1.11.0
>     [ ] +0
>     [ ] -1 Do not release this because...
>
>

Re: [VOTE] Release Apache Parquet 1.11.0 RC7

Posted by "Wang, Yuming" <yu...@ebay.com.INVALID>.

+1 
Tested Parquet 1.11.0 with Spark SQL module: build/sbt "sql/test-only" -Phadoop-3.2

On 2019/11/13, 21:33, "Gabor Szadovszky" <ga...@apache.org> wrote:

    Hi everyone,
    
    I propose the following RC to be released as official Apache Parquet 1.11.0
    release.
    
    The commit id is 18519eb8e059865652eee3ff0e8593f126701da4
    * This corresponds to the tag: apache-parquet-1.11.0-rc7
    *
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Ftree%2F18519eb8e059865652eee3ff0e8593f126701da4&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=ToLFrTB9lU%2FGzH6UpXwy7PAY7kaupbyKAgdghESCfgg%3D&amp;reserved=0
    
    The release tarball, signature, and checksums are here:
    * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdist.apache.org%2Frepos%2Fdist%2Fdev%2Fparquet%2Fapache-parquet-1.11.0-rc7&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=MPaHiYJT7ZcqreAYUkvDvZugthUhRPrySdXpN2ytT5k%3D&amp;reserved=0
    
    You can find the KEYS file here:
    * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapache.org%2Fdist%2Fparquet%2FKEYS&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=IwG4MUGsP2lVzlD4bwZUEPuEAPUg%2FHXRYtxc5CQupBM%3D&amp;reserved=0
    
    Binary artifacts are staged in Nexus here:
    * https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frepository.apache.org%2Fcontent%2Fgroups%2Fstaging%2Forg%2Fapache%2Fparquet%2F&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=lHtqLRQqQFwsyoaLSVaJuau5gxPKsCQFFVJaY8H0tZQ%3D&amp;reserved=0
    
    This release includes the changes listed at:
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fparquet-mr%2Fblob%2Fapache-parquet-1.11.0-rc7%2FCHANGES.md&amp;data=02%7C01%7Cyumwang%40ebay.com%7C8d588ca5855842a94bed08d7683e1221%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637092488114756267&amp;sdata=82BplI3bLAL6qArLHvVoYReZOk%2BboSP655rI8VX5Q5I%3D&amp;reserved=0
    
    Please download, verify, and test.
    
    Please vote in the next 72 hours.
    
    [ ] +1 Release this as Apache Parquet 1.11.0
    [ ] +0
    [ ] -1 Do not release this because...