You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Renjie Liu <li...@gmail.com> on 2019/10/09 03:48:07 UTC

[DISCUSS] Proposal about integration test of arrow parquet reader

Hi:

I'm developing rust version of reader which reads parquet into arrow array.
To verify the correct of this reader, I use the following approach:


   1. Define schema with protobuf.
   2. Generate json data of this schema using other language with more
   sophisticated implementation (e.g. java)
   3. Generate parquet data of this schema using other language with more
   sophisticated implementation (e.g. java)
   4. Write tests to read json file, and parquet file into memory (arrow
   array), then compare json data with arrow data.

 I think with this method we can guarantee the correctness of arrow reader
because json format is ubiquitous and their implementation are more stable.

Any comment is appreciated.

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

Currently for parquet reader of rust version only, some static files
covering some types would be enough.
However, I agree with Wes that we should not rely on static binary files
for functional tests because it's hard to maintain with the evolving of
arrow. For example, currently parquet reader in c++ doesn't support nested
type, so parquet file without nested type maybe easier to test against. But
when it can support nested types, we need to change again.
I think we should setup a program which can generate binary file for test
on demand in every test, which can be used by arrow libraries written in
different languages. I'll write a proposal about this in google doc.

On Sun, Oct 13, 2019 at 4:25 AM Wes McKinney <we...@gmail.com> wrote:

> I think the ideal scenario is to have a mix of "endogenous" unit
> testing and functional testing against real files to test for
> regressions or cross-compatibility. To criticize the work we've done
> in the C++ project, we have not done enough systematic integration
> testing IMHO, but we do test against some "bad files" that have
> accumulated.
>
> In any case, I think it's bad practice for a file format reader to
> rely exclusively on functional testing against static binary files.
>
> This good be a good opportunity to devise a language-agnostic Parquet
> integration testing strategy. Given that we're looking to add nested
> data support in C++ hopefully by the end of 2020, it would be good
> timing.
>
> On Sat, Oct 12, 2019 at 11:12 AM Andy Grove <an...@gmail.com> wrote:
> >
> > I also think that there are valid use cases for checking in binary files,
> > but we have to be careful not to abuse this. For example, we might want
> to
> > check in a Parquet file created by a particular version of Apache Spark
> to
> > ensure that Arrow implementations can read it successfully (hypothetical
> > example).
> >
> > It would also be good to have a small set of Parquet files using every
> > possible data type that all implementations can use in their tests. I
> > suppose we might want one set per Arrow format version as well.
> >
> > The problem we have now, in my opinion, is that we're proposing adding
> > files on a pretty ad-hoc basis, driven by the needs of individual
> > contributors in one language implementation, and this is perhaps
> happening
> > because we don't already have a good set of standard test files.
> >
> > Renjie - perhaps you could comment on this. If we had these standard
> files
> > covering all data types, would that have worked for you in this instance?
> >
> > Thanks,
> >
> > Andy.
> >
> > On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > Hi Wes,
> > > >
> > > > I additionally would prefer generating the test corpus at test time
> > > > rather than checking in binary files.
> > >
> > >
> > > Can you elaborate on this? I think both generated on the fly and
> example
> > > files are useful.
> > >
> > > The checked in files catch regressions even when readers/writers can
> read
> > > their own data but they have either incorrect or undefined behavior in
> > > regards to the specification (for example I would imagine checking in a
> > > file as part of the fix for ARROW-6844
> > > <https://issues.apache.org/jira/browse/ARROW-6844>).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <li...@gmail.com>
> > > wrote:
> > >
> > > > Thanks wes. Sure I'll fix it.
> > > >
> > > > Wes McKinney <we...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
> > > >
> > > > > I just merged the PR
> https://github.com/apache/arrow-testing/pull/11
> > > > >
> > > > > Various aspects of this make me uncomfortable so I hope they can be
> > > > > addressed in follow up work
> > > > >
> > > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <
> liurenjie2008@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > I've create ticket to track here:
> > > > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > > > >
> > > > > > For this moment, can we check in those pregenerated data to
> unblock
> > > > rust
> > > > > > version's arrow reader?
> > > > > >
> > > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <
> liurenjie2008@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > It would be fine in that case.
> > > > > > >
> > > > > > > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > > > > >
> > > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> > > liurenjie2008@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > 1. There already exists a low level parquet writer which can
> > > > produce
> > > > > > >> > parquet file, so unit test should be fine. But writer from
> arrow
> > > > to
> > > > > > >> parquet
> > > > > > >> > doesn't exist yet, and it may take some period of time to
> finish
> > > > it.
> > > > > > >> > 2. In fact my data are randomly generated and it's
> definitely
> > > > > > >> reproducible.
> > > > > > >> > However, I don't think it would be good idea to randomly
> > > generate
> > > > > data
> > > > > > >> > everytime we run ci because it would be difficult to debug.
> For
> > > > > example
> > > > > > >> PR
> > > > > > >> > a introduced a bug, which is triggerred in other PR's build
> it
> > > > > would be
> > > > > > >> > confusing for contributors.
> > > > > > >>
> > > > > > >> Presumably any random data generation would use a fixed seed
> > > > precisely
> > > > > > >> to be reproducible.
> > > > > > >>
> > > > > > >> > 3. I think it would be good idea to spend effort on
> integration
> > > > test
> > > > > > >> with
> > > > > > >> > parquet because it's an important use case of arrow. Also
> > > similar
> > > > > > >> approach
> > > > > > >> > could be extended to other language and other file
> format(avro,
> > > > > orc).
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> > > wesmckinn@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > There are a number of issues worth discussion.
> > > > > > >> > >
> > > > > > >> > > 1. What is the timeline/plan for Rust implementing a
> Parquet
> > > > > _writer_?
> > > > > > >> > > It's OK to be reliant on other libraries in the short
> term to
> > > > > produce
> > > > > > >> > > files to test against, but does not strike me as a
> sustainable
> > > > > > >> > > long-term plan. Fixing bugs can be a lot more difficult
> than
> > > it
> > > > > needs
> > > > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > > > >> > >
> > > > > > >> > > 2. Reproducible data generation
> > > > > > >> > >
> > > > > > >> > > I think if you're going to test against a pre-generated
> > > corpus,
> > > > > you
> > > > > > >> > > should make sure that generating the corpus is
> reproducible
> > > for
> > > > > other
> > > > > > >> > > developers (i.e. with a Dockerfile), and can be extended
> by
> > > > > adding new
> > > > > > >> > > files or random data generation.
> > > > > > >> > >
> > > > > > >> > > I additionally would prefer generating the test corpus at
> test
> > > > > time
> > > > > > >> > > rather than checking in binary files. If this isn't viable
> > > right
> > > > > now
> > > > > > >> > > we can create an "arrow-rust-crutch" git repository for
> you to
> > > > > stash
> > > > > > >> > > binary files until some of these testing scalability
> issues
> > > are
> > > > > > >> > > addressed.
> > > > > > >> > >
> > > > > > >> > > If we're going to spend energy on Parquet integration
> testing
> > > > with
> > > > > > >> > > Java, this would be a good opportunity to do the work in
> a way
> > > > > where
> > > > > > >> > > the C++ Parquet library can also participate (since we
> ought
> > > to
> > > > be
> > > > > > >> > > doing integration tests with Java, and we can also read
> JSON
> > > > > files to
> > > > > > >> > > Arrow).
> > > > > > >> > >
> > > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > > > liurenjie2008@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > > > andygrove73@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > I'm very interested in helping to find a solution to
> this
> > > > > because
> > > > > > >> we
> > > > > > >> > > really
> > > > > > >> > > > > do need integration tests for Rust to make sure we're
> > > > > compatible
> > > > > > >> with
> > > > > > >> > > other
> > > > > > >> > > > > implementations... there is also the ongoing CI
> > > > dockerization
> > > > > work
> > > > > > >> > > that I
> > > > > > >> > > > > feel is related.
> > > > > > >> > > > >
> > > > > > >> > > > > I haven't looked at the current integration tests yet
> and
> > > > > would
> > > > > > >> > > appreciate
> > > > > > >> > > > > some pointers on how all of this works (do we have
> docs?)
> > > or
> > > > > > >> where to
> > > > > > >> > > start
> > > > > > >> > > > > looking.
> > > > > > >> > > > >
> > > > > > >> > > > I have a test in my latest PR:
> > > > > > >> https://github.com/apache/arrow/pull/5523
> > > > > > >> > > > And here is the generated data:
> > > > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > > > >> > > > As with program to generate these data, it's just a
> simple
> > > > java
> > > > > > >> program.
> > > > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > > > >> > > >
> > > > > > >> > > > >
> > > > > > >> > > > > I imagine the integration test could follow the
> approach
> > > > that
> > > > > > >> Renjie is
> > > > > > >> > > > > outlining where we call Java to generate some files
> and
> > > then
> > > > > call
> > > > > > >> Rust
> > > > > > >> > > to
> > > > > > >> > > > > parse them?
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Andy.
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > > > >> liurenjie2008@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi:
> > > > > > >> > > > > >
> > > > > > >> > > > > > I'm developing rust version of reader which reads
> > > parquet
> > > > > into
> > > > > > >> arrow
> > > > > > >> > > > > array.
> > > > > > >> > > > > > To verify the correct of this reader, I use the
> > > following
> > > > > > >> approach:
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. Define schema with protobuf.
> > > > > > >> > > > > >    2. Generate json data of this schema using other
> > > > language
> > > > > > >> with
> > > > > > >> > > more
> > > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > > >> > > > > >    3. Generate parquet data of this schema using
> other
> > > > > language
> > > > > > >> with
> > > > > > >> > > more
> > > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > > >> > > > > >    4. Write tests to read json file, and parquet
> file
> > > into
> > > > > > >> memory
> > > > > > >> > > (arrow
> > > > > > >> > > > > >    array), then compare json data with arrow data.
> > > > > > >> > > > > >
> > > > > > >> > > > > >  I think with this method we can guarantee the
> > > correctness
> > > > > of
> > > > > > >> arrow
> > > > > > >> > > > > reader
> > > > > > >> > > > > > because json format is ubiquitous and their
> > > implementation
> > > > > are
> > > > > > >> more
> > > > > > >> > > > > stable.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Any comment is appreciated.
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > --
> > > > > > >> > > > Renjie Liu
> > > > > > >> > > > Software Engineer, MVAD
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > Renjie Liu
> > > > > > >> > Software Engineer, MVAD
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Renjie Liu
> > > > > > Software Engineer, MVAD
> > > > >
> > > >
> > >
>


-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Wes McKinney <we...@gmail.com>.

I think the ideal scenario is to have a mix of "endogenous" unit
testing and functional testing against real files to test for
regressions or cross-compatibility. To criticize the work we've done
in the C++ project, we have not done enough systematic integration
testing IMHO, but we do test against some "bad files" that have
accumulated.

In any case, I think it's bad practice for a file format reader to
rely exclusively on functional testing against static binary files.

This good be a good opportunity to devise a language-agnostic Parquet
integration testing strategy. Given that we're looking to add nested
data support in C++ hopefully by the end of 2020, it would be good
timing.

On Sat, Oct 12, 2019 at 11:12 AM Andy Grove <an...@gmail.com> wrote:
>
> I also think that there are valid use cases for checking in binary files,
> but we have to be careful not to abuse this. For example, we might want to
> check in a Parquet file created by a particular version of Apache Spark to
> ensure that Arrow implementations can read it successfully (hypothetical
> example).
>
> It would also be good to have a small set of Parquet files using every
> possible data type that all implementations can use in their tests. I
> suppose we might want one set per Arrow format version as well.
>
> The problem we have now, in my opinion, is that we're proposing adding
> files on a pretty ad-hoc basis, driven by the needs of individual
> contributors in one language implementation, and this is perhaps happening
> because we don't already have a good set of standard test files.
>
> Renjie - perhaps you could comment on this. If we had these standard files
> covering all data types, would that have worked for you in this instance?
>
> Thanks,
>
> Andy.
>
> On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hi Wes,
> > >
> > > I additionally would prefer generating the test corpus at test time
> > > rather than checking in binary files.
> >
> >
> > Can you elaborate on this? I think both generated on the fly and example
> > files are useful.
> >
> > The checked in files catch regressions even when readers/writers can read
> > their own data but they have either incorrect or undefined behavior in
> > regards to the specification (for example I would imagine checking in a
> > file as part of the fix for ARROW-6844
> > <https://issues.apache.org/jira/browse/ARROW-6844>).
> >
> > Thanks,
> > Micah
> >
> > On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <li...@gmail.com>
> > wrote:
> >
> > > Thanks wes. Sure I'll fix it.
> > >
> > > Wes McKinney <we...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
> > >
> > > > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> > > >
> > > > Various aspects of this make me uncomfortable so I hope they can be
> > > > addressed in follow up work
> > > >
> > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <li...@gmail.com>
> > > > wrote:
> > > > >
> > > > > I've create ticket to track here:
> > > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > > >
> > > > > For this moment, can we check in those pregenerated data to unblock
> > > rust
> > > > > version's arrow reader?
> > > > >
> > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > It would be fine in that case.
> > > > > >
> > > > > > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > > > >
> > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> > liurenjie2008@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > 1. There already exists a low level parquet writer which can
> > > produce
> > > > > >> > parquet file, so unit test should be fine. But writer from arrow
> > > to
> > > > > >> parquet
> > > > > >> > doesn't exist yet, and it may take some period of time to finish
> > > it.
> > > > > >> > 2. In fact my data are randomly generated and it's definitely
> > > > > >> reproducible.
> > > > > >> > However, I don't think it would be good idea to randomly
> > generate
> > > > data
> > > > > >> > everytime we run ci because it would be difficult to debug. For
> > > > example
> > > > > >> PR
> > > > > >> > a introduced a bug, which is triggerred in other PR's build it
> > > > would be
> > > > > >> > confusing for contributors.
> > > > > >>
> > > > > >> Presumably any random data generation would use a fixed seed
> > > precisely
> > > > > >> to be reproducible.
> > > > > >>
> > > > > >> > 3. I think it would be good idea to spend effort on integration
> > > test
> > > > > >> with
> > > > > >> > parquet because it's an important use case of arrow. Also
> > similar
> > > > > >> approach
> > > > > >> > could be extended to other language and other file format(avro,
> > > > orc).
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> > wesmckinn@gmail.com
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > There are a number of issues worth discussion.
> > > > > >> > >
> > > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> > > > _writer_?
> > > > > >> > > It's OK to be reliant on other libraries in the short term to
> > > > produce
> > > > > >> > > files to test against, but does not strike me as a sustainable
> > > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than
> > it
> > > > needs
> > > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > > >> > >
> > > > > >> > > 2. Reproducible data generation
> > > > > >> > >
> > > > > >> > > I think if you're going to test against a pre-generated
> > corpus,
> > > > you
> > > > > >> > > should make sure that generating the corpus is reproducible
> > for
> > > > other
> > > > > >> > > developers (i.e. with a Dockerfile), and can be extended by
> > > > adding new
> > > > > >> > > files or random data generation.
> > > > > >> > >
> > > > > >> > > I additionally would prefer generating the test corpus at test
> > > > time
> > > > > >> > > rather than checking in binary files. If this isn't viable
> > right
> > > > now
> > > > > >> > > we can create an "arrow-rust-crutch" git repository for you to
> > > > stash
> > > > > >> > > binary files until some of these testing scalability issues
> > are
> > > > > >> > > addressed.
> > > > > >> > >
> > > > > >> > > If we're going to spend energy on Parquet integration testing
> > > with
> > > > > >> > > Java, this would be a good opportunity to do the work in a way
> > > > where
> > > > > >> > > the C++ Parquet library can also participate (since we ought
> > to
> > > be
> > > > > >> > > doing integration tests with Java, and we can also read JSON
> > > > files to
> > > > > >> > > Arrow).
> > > > > >> > >
> > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > > liurenjie2008@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > > andygrove73@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > I'm very interested in helping to find a solution to this
> > > > because
> > > > > >> we
> > > > > >> > > really
> > > > > >> > > > > do need integration tests for Rust to make sure we're
> > > > compatible
> > > > > >> with
> > > > > >> > > other
> > > > > >> > > > > implementations... there is also the ongoing CI
> > > dockerization
> > > > work
> > > > > >> > > that I
> > > > > >> > > > > feel is related.
> > > > > >> > > > >
> > > > > >> > > > > I haven't looked at the current integration tests yet and
> > > > would
> > > > > >> > > appreciate
> > > > > >> > > > > some pointers on how all of this works (do we have docs?)
> > or
> > > > > >> where to
> > > > > >> > > start
> > > > > >> > > > > looking.
> > > > > >> > > > >
> > > > > >> > > > I have a test in my latest PR:
> > > > > >> https://github.com/apache/arrow/pull/5523
> > > > > >> > > > And here is the generated data:
> > > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > > >> > > > As with program to generate these data, it's just a simple
> > > java
> > > > > >> program.
> > > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > > >> > > >
> > > > > >> > > > >
> > > > > >> > > > > I imagine the integration test could follow the approach
> > > that
> > > > > >> Renjie is
> > > > > >> > > > > outlining where we call Java to generate some files and
> > then
> > > > call
> > > > > >> Rust
> > > > > >> > > to
> > > > > >> > > > > parse them?
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Andy.
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > > >> liurenjie2008@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi:
> > > > > >> > > > > >
> > > > > >> > > > > > I'm developing rust version of reader which reads
> > parquet
> > > > into
> > > > > >> arrow
> > > > > >> > > > > array.
> > > > > >> > > > > > To verify the correct of this reader, I use the
> > following
> > > > > >> approach:
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >    1. Define schema with protobuf.
> > > > > >> > > > > >    2. Generate json data of this schema using other
> > > language
> > > > > >> with
> > > > > >> > > more
> > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > >> > > > > >    3. Generate parquet data of this schema using other
> > > > language
> > > > > >> with
> > > > > >> > > more
> > > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > > >> > > > > >    4. Write tests to read json file, and parquet file
> > into
> > > > > >> memory
> > > > > >> > > (arrow
> > > > > >> > > > > >    array), then compare json data with arrow data.
> > > > > >> > > > > >
> > > > > >> > > > > >  I think with this method we can guarantee the
> > correctness
> > > > of
> > > > > >> arrow
> > > > > >> > > > > reader
> > > > > >> > > > > > because json format is ubiquitous and their
> > implementation
> > > > are
> > > > > >> more
> > > > > >> > > > > stable.
> > > > > >> > > > > >
> > > > > >> > > > > > Any comment is appreciated.
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Renjie Liu
> > > > > >> > > > Software Engineer, MVAD
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Renjie Liu
> > > > > >> > Software Engineer, MVAD
> > > > > >>
> > > > > >
> > > > >
> > > > > --
> > > > > Renjie Liu
> > > > > Software Engineer, MVAD
> > > >
> > >
> >

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Andy Grove <an...@gmail.com>.

I also think that there are valid use cases for checking in binary files,
but we have to be careful not to abuse this. For example, we might want to
check in a Parquet file created by a particular version of Apache Spark to
ensure that Arrow implementations can read it successfully (hypothetical
example).

It would also be good to have a small set of Parquet files using every
possible data type that all implementations can use in their tests. I
suppose we might want one set per Arrow format version as well.

The problem we have now, in my opinion, is that we're proposing adding
files on a pretty ad-hoc basis, driven by the needs of individual
contributors in one language implementation, and this is perhaps happening
because we don't already have a good set of standard test files.

Renjie - perhaps you could comment on this. If we had these standard files
covering all data types, would that have worked for you in this instance?

Thanks,

Andy.

On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Wes,
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files.
>
>
> Can you elaborate on this? I think both generated on the fly and example
> files are useful.
>
> The checked in files catch regressions even when readers/writers can read
> their own data but they have either incorrect or undefined behavior in
> regards to the specification (for example I would imagine checking in a
> file as part of the fix for ARROW-6844
> <https://issues.apache.org/jira/browse/ARROW-6844>).
>
> Thanks,
> Micah
>
> On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <li...@gmail.com>
> wrote:
>
> > Thanks wes. Sure I'll fix it.
> >
> > Wes McKinney <we...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
> >
> > > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> > >
> > > Various aspects of this make me uncomfortable so I hope they can be
> > > addressed in follow up work
> > >
> > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <li...@gmail.com>
> > > wrote:
> > > >
> > > > I've create ticket to track here:
> > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > >
> > > > For this moment, can we check in those pregenerated data to unblock
> > rust
> > > > version's arrow reader?
> > > >
> > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com>
> > > wrote:
> > > >
> > > > > It would be fine in that case.
> > > > >
> > > > > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > > >
> > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> liurenjie2008@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > 1. There already exists a low level parquet writer which can
> > produce
> > > > >> > parquet file, so unit test should be fine. But writer from arrow
> > to
> > > > >> parquet
> > > > >> > doesn't exist yet, and it may take some period of time to finish
> > it.
> > > > >> > 2. In fact my data are randomly generated and it's definitely
> > > > >> reproducible.
> > > > >> > However, I don't think it would be good idea to randomly
> generate
> > > data
> > > > >> > everytime we run ci because it would be difficult to debug. For
> > > example
> > > > >> PR
> > > > >> > a introduced a bug, which is triggerred in other PR's build it
> > > would be
> > > > >> > confusing for contributors.
> > > > >>
> > > > >> Presumably any random data generation would use a fixed seed
> > precisely
> > > > >> to be reproducible.
> > > > >>
> > > > >> > 3. I think it would be good idea to spend effort on integration
> > test
> > > > >> with
> > > > >> > parquet because it's an important use case of arrow. Also
> similar
> > > > >> approach
> > > > >> > could be extended to other language and other file format(avro,
> > > orc).
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> wesmckinn@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > There are a number of issues worth discussion.
> > > > >> > >
> > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> > > _writer_?
> > > > >> > > It's OK to be reliant on other libraries in the short term to
> > > produce
> > > > >> > > files to test against, but does not strike me as a sustainable
> > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than
> it
> > > needs
> > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > >> > >
> > > > >> > > 2. Reproducible data generation
> > > > >> > >
> > > > >> > > I think if you're going to test against a pre-generated
> corpus,
> > > you
> > > > >> > > should make sure that generating the corpus is reproducible
> for
> > > other
> > > > >> > > developers (i.e. with a Dockerfile), and can be extended by
> > > adding new
> > > > >> > > files or random data generation.
> > > > >> > >
> > > > >> > > I additionally would prefer generating the test corpus at test
> > > time
> > > > >> > > rather than checking in binary files. If this isn't viable
> right
> > > now
> > > > >> > > we can create an "arrow-rust-crutch" git repository for you to
> > > stash
> > > > >> > > binary files until some of these testing scalability issues
> are
> > > > >> > > addressed.
> > > > >> > >
> > > > >> > > If we're going to spend energy on Parquet integration testing
> > with
> > > > >> > > Java, this would be a good opportunity to do the work in a way
> > > where
> > > > >> > > the C++ Parquet library can also participate (since we ought
> to
> > be
> > > > >> > > doing integration tests with Java, and we can also read JSON
> > > files to
> > > > >> > > Arrow).
> > > > >> > >
> > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > liurenjie2008@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > andygrove73@gmail.com>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > I'm very interested in helping to find a solution to this
> > > because
> > > > >> we
> > > > >> > > really
> > > > >> > > > > do need integration tests for Rust to make sure we're
> > > compatible
> > > > >> with
> > > > >> > > other
> > > > >> > > > > implementations... there is also the ongoing CI
> > dockerization
> > > work
> > > > >> > > that I
> > > > >> > > > > feel is related.
> > > > >> > > > >
> > > > >> > > > > I haven't looked at the current integration tests yet and
> > > would
> > > > >> > > appreciate
> > > > >> > > > > some pointers on how all of this works (do we have docs?)
> or
> > > > >> where to
> > > > >> > > start
> > > > >> > > > > looking.
> > > > >> > > > >
> > > > >> > > > I have a test in my latest PR:
> > > > >> https://github.com/apache/arrow/pull/5523
> > > > >> > > > And here is the generated data:
> > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > >> > > > As with program to generate these data, it's just a simple
> > java
> > > > >> program.
> > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > >> > > >
> > > > >> > > > >
> > > > >> > > > > I imagine the integration test could follow the approach
> > that
> > > > >> Renjie is
> > > > >> > > > > outlining where we call Java to generate some files and
> then
> > > call
> > > > >> Rust
> > > > >> > > to
> > > > >> > > > > parse them?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Andy.
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > >> liurenjie2008@gmail.com>
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi:
> > > > >> > > > > >
> > > > >> > > > > > I'm developing rust version of reader which reads
> parquet
> > > into
> > > > >> arrow
> > > > >> > > > > array.
> > > > >> > > > > > To verify the correct of this reader, I use the
> following
> > > > >> approach:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >    1. Define schema with protobuf.
> > > > >> > > > > >    2. Generate json data of this schema using other
> > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > >> > > > > >    3. Generate parquet data of this schema using other
> > > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > >> > > > > >    4. Write tests to read json file, and parquet file
> into
> > > > >> memory
> > > > >> > > (arrow
> > > > >> > > > > >    array), then compare json data with arrow data.
> > > > >> > > > > >
> > > > >> > > > > >  I think with this method we can guarantee the
> correctness
> > > of
> > > > >> arrow
> > > > >> > > > > reader
> > > > >> > > > > > because json format is ubiquitous and their
> implementation
> > > are
> > > > >> more
> > > > >> > > > > stable.
> > > > >> > > > > >
> > > > >> > > > > > Any comment is appreciated.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Renjie Liu
> > > > >> > > > Software Engineer, MVAD
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Renjie Liu
> > > > >> > Software Engineer, MVAD
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Renjie Liu
> > > > Software Engineer, MVAD
> > >
> >
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes,
>
> I additionally would prefer generating the test corpus at test time
> rather than checking in binary files.


Can you elaborate on this? I think both generated on the fly and example
files are useful.

The checked in files catch regressions even when readers/writers can read
their own data but they have either incorrect or undefined behavior in
regards to the specification (for example I would imagine checking in a
file as part of the fix for ARROW-6844
<https://issues.apache.org/jira/browse/ARROW-6844>).

Thanks,
Micah

On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <li...@gmail.com> wrote:

> Thanks wes. Sure I'll fix it.
>
> Wes McKinney <we...@gmail.com> 于 2019年10月11日周五 上午6:10写道：
>
> > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> >
> > Various aspects of this make me uncomfortable so I hope they can be
> > addressed in follow up work
> >
> > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <li...@gmail.com>
> > wrote:
> > >
> > > I've create ticket to track here:
> > > https://issues.apache.org/jira/browse/ARROW-6845
> > >
> > > For this moment, can we check in those pregenerated data to unblock
> rust
> > > version's arrow reader?
> > >
> > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com>
> > wrote:
> > >
> > > > It would be fine in that case.
> > > >
> > > > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > > >
> > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <liurenjie2008@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> > 1. There already exists a low level parquet writer which can
> produce
> > > >> > parquet file, so unit test should be fine. But writer from arrow
> to
> > > >> parquet
> > > >> > doesn't exist yet, and it may take some period of time to finish
> it.
> > > >> > 2. In fact my data are randomly generated and it's definitely
> > > >> reproducible.
> > > >> > However, I don't think it would be good idea to randomly generate
> > data
> > > >> > everytime we run ci because it would be difficult to debug. For
> > example
> > > >> PR
> > > >> > a introduced a bug, which is triggerred in other PR's build it
> > would be
> > > >> > confusing for contributors.
> > > >>
> > > >> Presumably any random data generation would use a fixed seed
> precisely
> > > >> to be reproducible.
> > > >>
> > > >> > 3. I think it would be good idea to spend effort on integration
> test
> > > >> with
> > > >> > parquet because it's an important use case of arrow. Also similar
> > > >> approach
> > > >> > could be extended to other language and other file format(avro,
> > orc).
> > > >> >
> > > >> >
> > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <wesmckinn@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > There are a number of issues worth discussion.
> > > >> > >
> > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> > _writer_?
> > > >> > > It's OK to be reliant on other libraries in the short term to
> > produce
> > > >> > > files to test against, but does not strike me as a sustainable
> > > >> > > long-term plan. Fixing bugs can be a lot more difficult than it
> > needs
> > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > >> > >
> > > >> > > 2. Reproducible data generation
> > > >> > >
> > > >> > > I think if you're going to test against a pre-generated corpus,
> > you
> > > >> > > should make sure that generating the corpus is reproducible for
> > other
> > > >> > > developers (i.e. with a Dockerfile), and can be extended by
> > adding new
> > > >> > > files or random data generation.
> > > >> > >
> > > >> > > I additionally would prefer generating the test corpus at test
> > time
> > > >> > > rather than checking in binary files. If this isn't viable right
> > now
> > > >> > > we can create an "arrow-rust-crutch" git repository for you to
> > stash
> > > >> > > binary files until some of these testing scalability issues are
> > > >> > > addressed.
> > > >> > >
> > > >> > > If we're going to spend energy on Parquet integration testing
> with
> > > >> > > Java, this would be a good opportunity to do the work in a way
> > where
> > > >> > > the C++ Parquet library can also participate (since we ought to
> be
> > > >> > > doing integration tests with Java, and we can also read JSON
> > files to
> > > >> > > Arrow).
> > > >> > >
> > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > liurenjie2008@gmail.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > andygrove73@gmail.com>
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > > I'm very interested in helping to find a solution to this
> > because
> > > >> we
> > > >> > > really
> > > >> > > > > do need integration tests for Rust to make sure we're
> > compatible
> > > >> with
> > > >> > > other
> > > >> > > > > implementations... there is also the ongoing CI
> dockerization
> > work
> > > >> > > that I
> > > >> > > > > feel is related.
> > > >> > > > >
> > > >> > > > > I haven't looked at the current integration tests yet and
> > would
> > > >> > > appreciate
> > > >> > > > > some pointers on how all of this works (do we have docs?) or
> > > >> where to
> > > >> > > start
> > > >> > > > > looking.
> > > >> > > > >
> > > >> > > > I have a test in my latest PR:
> > > >> https://github.com/apache/arrow/pull/5523
> > > >> > > > And here is the generated data:
> > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > >> > > > As with program to generate these data, it's just a simple
> java
> > > >> program.
> > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > >> > > >
> > > >> > > > >
> > > >> > > > > I imagine the integration test could follow the approach
> that
> > > >> Renjie is
> > > >> > > > > outlining where we call Java to generate some files and then
> > call
> > > >> Rust
> > > >> > > to
> > > >> > > > > parse them?
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Andy.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > >> liurenjie2008@gmail.com>
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi:
> > > >> > > > > >
> > > >> > > > > > I'm developing rust version of reader which reads parquet
> > into
> > > >> arrow
> > > >> > > > > array.
> > > >> > > > > > To verify the correct of this reader, I use the following
> > > >> approach:
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >    1. Define schema with protobuf.
> > > >> > > > > >    2. Generate json data of this schema using other
> language
> > > >> with
> > > >> > > more
> > > >> > > > > >    sophisticated implementation (e.g. java)
> > > >> > > > > >    3. Generate parquet data of this schema using other
> > language
> > > >> with
> > > >> > > more
> > > >> > > > > >    sophisticated implementation (e.g. java)
> > > >> > > > > >    4. Write tests to read json file, and parquet file into
> > > >> memory
> > > >> > > (arrow
> > > >> > > > > >    array), then compare json data with arrow data.
> > > >> > > > > >
> > > >> > > > > >  I think with this method we can guarantee the correctness
> > of
> > > >> arrow
> > > >> > > > > reader
> > > >> > > > > > because json format is ubiquitous and their implementation
> > are
> > > >> more
> > > >> > > > > stable.
> > > >> > > > > >
> > > >> > > > > > Any comment is appreciated.
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Renjie Liu
> > > >> > > > Software Engineer, MVAD
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Renjie Liu
> > > >> > Software Engineer, MVAD
> > > >>
> > > >
> > >
> > > --
> > > Renjie Liu
> > > Software Engineer, MVAD
> >
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

Thanks wes. Sure I'll fix it.

Wes McKinney <we...@gmail.com> 于 2019年10月11日周五 上午6:10写道：

> I just merged the PR https://github.com/apache/arrow-testing/pull/11
>
> Various aspects of this make me uncomfortable so I hope they can be
> addressed in follow up work
>
> On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > I've create ticket to track here:
> > https://issues.apache.org/jira/browse/ARROW-6845
> >
> > For this moment, can we check in those pregenerated data to unblock rust
> > version's arrow reader?
> >
> > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > > It would be fine in that case.
> > >
> > > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> > >
> > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <li...@gmail.com>
> > >> wrote:
> > >> >
> > >> > 1. There already exists a low level parquet writer which can produce
> > >> > parquet file, so unit test should be fine. But writer from arrow to
> > >> parquet
> > >> > doesn't exist yet, and it may take some period of time to finish it.
> > >> > 2. In fact my data are randomly generated and it's definitely
> > >> reproducible.
> > >> > However, I don't think it would be good idea to randomly generate
> data
> > >> > everytime we run ci because it would be difficult to debug. For
> example
> > >> PR
> > >> > a introduced a bug, which is triggerred in other PR's build it
> would be
> > >> > confusing for contributors.
> > >>
> > >> Presumably any random data generation would use a fixed seed precisely
> > >> to be reproducible.
> > >>
> > >> > 3. I think it would be good idea to spend effort on integration test
> > >> with
> > >> > parquet because it's an important use case of arrow. Also similar
> > >> approach
> > >> > could be extended to other language and other file format(avro,
> orc).
> > >> >
> > >> >
> > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > There are a number of issues worth discussion.
> > >> > >
> > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> _writer_?
> > >> > > It's OK to be reliant on other libraries in the short term to
> produce
> > >> > > files to test against, but does not strike me as a sustainable
> > >> > > long-term plan. Fixing bugs can be a lot more difficult than it
> needs
> > >> > > to be if you can't write targeted "endogenous" unit tests
> > >> > >
> > >> > > 2. Reproducible data generation
> > >> > >
> > >> > > I think if you're going to test against a pre-generated corpus,
> you
> > >> > > should make sure that generating the corpus is reproducible for
> other
> > >> > > developers (i.e. with a Dockerfile), and can be extended by
> adding new
> > >> > > files or random data generation.
> > >> > >
> > >> > > I additionally would prefer generating the test corpus at test
> time
> > >> > > rather than checking in binary files. If this isn't viable right
> now
> > >> > > we can create an "arrow-rust-crutch" git repository for you to
> stash
> > >> > > binary files until some of these testing scalability issues are
> > >> > > addressed.
> > >> > >
> > >> > > If we're going to spend energy on Parquet integration testing with
> > >> > > Java, this would be a good opportunity to do the work in a way
> where
> > >> > > the C++ Parquet library can also participate (since we ought to be
> > >> > > doing integration tests with Java, and we can also read JSON
> files to
> > >> > > Arrow).
> > >> > >
> > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> liurenjie2008@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> andygrove73@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > > I'm very interested in helping to find a solution to this
> because
> > >> we
> > >> > > really
> > >> > > > > do need integration tests for Rust to make sure we're
> compatible
> > >> with
> > >> > > other
> > >> > > > > implementations... there is also the ongoing CI dockerization
> work
> > >> > > that I
> > >> > > > > feel is related.
> > >> > > > >
> > >> > > > > I haven't looked at the current integration tests yet and
> would
> > >> > > appreciate
> > >> > > > > some pointers on how all of this works (do we have docs?) or
> > >> where to
> > >> > > start
> > >> > > > > looking.
> > >> > > > >
> > >> > > > I have a test in my latest PR:
> > >> https://github.com/apache/arrow/pull/5523
> > >> > > > And here is the generated data:
> > >> > > > https://github.com/apache/arrow-testing/pull/11
> > >> > > > As with program to generate these data, it's just a simple java
> > >> program.
> > >> > > > I'm not sure whether we need to integrate it into arrow.
> > >> > > >
> > >> > > > >
> > >> > > > > I imagine the integration test could follow the approach that
> > >> Renjie is
> > >> > > > > outlining where we call Java to generate some files and then
> call
> > >> Rust
> > >> > > to
> > >> > > > > parse them?
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Andy.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > >> liurenjie2008@gmail.com>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi:
> > >> > > > > >
> > >> > > > > > I'm developing rust version of reader which reads parquet
> into
> > >> arrow
> > >> > > > > array.
> > >> > > > > > To verify the correct of this reader, I use the following
> > >> approach:
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >    1. Define schema with protobuf.
> > >> > > > > >    2. Generate json data of this schema using other language
> > >> with
> > >> > > more
> > >> > > > > >    sophisticated implementation (e.g. java)
> > >> > > > > >    3. Generate parquet data of this schema using other
> language
> > >> with
> > >> > > more
> > >> > > > > >    sophisticated implementation (e.g. java)
> > >> > > > > >    4. Write tests to read json file, and parquet file into
> > >> memory
> > >> > > (arrow
> > >> > > > > >    array), then compare json data with arrow data.
> > >> > > > > >
> > >> > > > > >  I think with this method we can guarantee the correctness
> of
> > >> arrow
> > >> > > > > reader
> > >> > > > > > because json format is ubiquitous and their implementation
> are
> > >> more
> > >> > > > > stable.
> > >> > > > > >
> > >> > > > > > Any comment is appreciated.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Renjie Liu
> > >> > > > Software Engineer, MVAD
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Renjie Liu
> > >> > Software Engineer, MVAD
> > >>
> > >
> >
> > --
> > Renjie Liu
> > Software Engineer, MVAD
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Wes McKinney <we...@gmail.com>.

I just merged the PR https://github.com/apache/arrow-testing/pull/11

Various aspects of this make me uncomfortable so I hope they can be
addressed in follow up work

On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <li...@gmail.com> wrote:
>
> I've create ticket to track here:
> https://issues.apache.org/jira/browse/ARROW-6845
>
> For this moment, can we check in those pregenerated data to unblock rust
> version's arrow reader?
>
> On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com> wrote:
>
> > It would be fine in that case.
> >
> > Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
> >
> >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <li...@gmail.com>
> >> wrote:
> >> >
> >> > 1. There already exists a low level parquet writer which can produce
> >> > parquet file, so unit test should be fine. But writer from arrow to
> >> parquet
> >> > doesn't exist yet, and it may take some period of time to finish it.
> >> > 2. In fact my data are randomly generated and it's definitely
> >> reproducible.
> >> > However, I don't think it would be good idea to randomly generate data
> >> > everytime we run ci because it would be difficult to debug. For example
> >> PR
> >> > a introduced a bug, which is triggerred in other PR's build it would be
> >> > confusing for contributors.
> >>
> >> Presumably any random data generation would use a fixed seed precisely
> >> to be reproducible.
> >>
> >> > 3. I think it would be good idea to spend effort on integration test
> >> with
> >> > parquet because it's an important use case of arrow. Also similar
> >> approach
> >> > could be extended to other language and other file format(avro, orc).
> >> >
> >> >
> >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > There are a number of issues worth discussion.
> >> > >
> >> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> >> > > It's OK to be reliant on other libraries in the short term to produce
> >> > > files to test against, but does not strike me as a sustainable
> >> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
> >> > > to be if you can't write targeted "endogenous" unit tests
> >> > >
> >> > > 2. Reproducible data generation
> >> > >
> >> > > I think if you're going to test against a pre-generated corpus, you
> >> > > should make sure that generating the corpus is reproducible for other
> >> > > developers (i.e. with a Dockerfile), and can be extended by adding new
> >> > > files or random data generation.
> >> > >
> >> > > I additionally would prefer generating the test corpus at test time
> >> > > rather than checking in binary files. If this isn't viable right now
> >> > > we can create an "arrow-rust-crutch" git repository for you to stash
> >> > > binary files until some of these testing scalability issues are
> >> > > addressed.
> >> > >
> >> > > If we're going to spend energy on Parquet integration testing with
> >> > > Java, this would be a good opportunity to do the work in a way where
> >> > > the C++ Parquet library can also participate (since we ought to be
> >> > > doing integration tests with Java, and we can also read JSON files to
> >> > > Arrow).
> >> > >
> >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > I'm very interested in helping to find a solution to this because
> >> we
> >> > > really
> >> > > > > do need integration tests for Rust to make sure we're compatible
> >> with
> >> > > other
> >> > > > > implementations... there is also the ongoing CI dockerization work
> >> > > that I
> >> > > > > feel is related.
> >> > > > >
> >> > > > > I haven't looked at the current integration tests yet and would
> >> > > appreciate
> >> > > > > some pointers on how all of this works (do we have docs?) or
> >> where to
> >> > > start
> >> > > > > looking.
> >> > > > >
> >> > > > I have a test in my latest PR:
> >> https://github.com/apache/arrow/pull/5523
> >> > > > And here is the generated data:
> >> > > > https://github.com/apache/arrow-testing/pull/11
> >> > > > As with program to generate these data, it's just a simple java
> >> program.
> >> > > > I'm not sure whether we need to integrate it into arrow.
> >> > > >
> >> > > > >
> >> > > > > I imagine the integration test could follow the approach that
> >> Renjie is
> >> > > > > outlining where we call Java to generate some files and then call
> >> Rust
> >> > > to
> >> > > > > parse them?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Andy.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> >> liurenjie2008@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi:
> >> > > > > >
> >> > > > > > I'm developing rust version of reader which reads parquet into
> >> arrow
> >> > > > > array.
> >> > > > > > To verify the correct of this reader, I use the following
> >> approach:
> >> > > > > >
> >> > > > > >
> >> > > > > >    1. Define schema with protobuf.
> >> > > > > >    2. Generate json data of this schema using other language
> >> with
> >> > > more
> >> > > > > >    sophisticated implementation (e.g. java)
> >> > > > > >    3. Generate parquet data of this schema using other language
> >> with
> >> > > more
> >> > > > > >    sophisticated implementation (e.g. java)
> >> > > > > >    4. Write tests to read json file, and parquet file into
> >> memory
> >> > > (arrow
> >> > > > > >    array), then compare json data with arrow data.
> >> > > > > >
> >> > > > > >  I think with this method we can guarantee the correctness of
> >> arrow
> >> > > > > reader
> >> > > > > > because json format is ubiquitous and their implementation are
> >> more
> >> > > > > stable.
> >> > > > > >
> >> > > > > > Any comment is appreciated.
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Renjie Liu
> >> > > > Software Engineer, MVAD
> >> > >
> >> >
> >> >
> >> > --
> >> > Renjie Liu
> >> > Software Engineer, MVAD
> >>
> >
>
> --
> Renjie Liu
> Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

I've create ticket to track here:
https://issues.apache.org/jira/browse/ARROW-6845

For this moment, can we check in those pregenerated data to unblock rust
version's arrow reader?

On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <li...@gmail.com> wrote:

> It would be fine in that case.
>
> Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：
>
>> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <li...@gmail.com>
>> wrote:
>> >
>> > 1. There already exists a low level parquet writer which can produce
>> > parquet file, so unit test should be fine. But writer from arrow to
>> parquet
>> > doesn't exist yet, and it may take some period of time to finish it.
>> > 2. In fact my data are randomly generated and it's definitely
>> reproducible.
>> > However, I don't think it would be good idea to randomly generate data
>> > everytime we run ci because it would be difficult to debug. For example
>> PR
>> > a introduced a bug, which is triggerred in other PR's build it would be
>> > confusing for contributors.
>>
>> Presumably any random data generation would use a fixed seed precisely
>> to be reproducible.
>>
>> > 3. I think it would be good idea to spend effort on integration test
>> with
>> > parquet because it's an important use case of arrow. Also similar
>> approach
>> > could be extended to other language and other file format(avro, orc).
>> >
>> >
>> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > There are a number of issues worth discussion.
>> > >
>> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
>> > > It's OK to be reliant on other libraries in the short term to produce
>> > > files to test against, but does not strike me as a sustainable
>> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
>> > > to be if you can't write targeted "endogenous" unit tests
>> > >
>> > > 2. Reproducible data generation
>> > >
>> > > I think if you're going to test against a pre-generated corpus, you
>> > > should make sure that generating the corpus is reproducible for other
>> > > developers (i.e. with a Dockerfile), and can be extended by adding new
>> > > files or random data generation.
>> > >
>> > > I additionally would prefer generating the test corpus at test time
>> > > rather than checking in binary files. If this isn't viable right now
>> > > we can create an "arrow-rust-crutch" git repository for you to stash
>> > > binary files until some of these testing scalability issues are
>> > > addressed.
>> > >
>> > > If we're going to spend energy on Parquet integration testing with
>> > > Java, this would be a good opportunity to do the work in a way where
>> > > the C++ Parquet library can also participate (since we ought to be
>> > > doing integration tests with Java, and we can also read JSON files to
>> > > Arrow).
>> > >
>> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com>
>> > > wrote:
>> > > >
>> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > I'm very interested in helping to find a solution to this because
>> we
>> > > really
>> > > > > do need integration tests for Rust to make sure we're compatible
>> with
>> > > other
>> > > > > implementations... there is also the ongoing CI dockerization work
>> > > that I
>> > > > > feel is related.
>> > > > >
>> > > > > I haven't looked at the current integration tests yet and would
>> > > appreciate
>> > > > > some pointers on how all of this works (do we have docs?) or
>> where to
>> > > start
>> > > > > looking.
>> > > > >
>> > > > I have a test in my latest PR:
>> https://github.com/apache/arrow/pull/5523
>> > > > And here is the generated data:
>> > > > https://github.com/apache/arrow-testing/pull/11
>> > > > As with program to generate these data, it's just a simple java
>> program.
>> > > > I'm not sure whether we need to integrate it into arrow.
>> > > >
>> > > > >
>> > > > > I imagine the integration test could follow the approach that
>> Renjie is
>> > > > > outlining where we call Java to generate some files and then call
>> Rust
>> > > to
>> > > > > parse them?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Andy.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
>> liurenjie2008@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi:
>> > > > > >
>> > > > > > I'm developing rust version of reader which reads parquet into
>> arrow
>> > > > > array.
>> > > > > > To verify the correct of this reader, I use the following
>> approach:
>> > > > > >
>> > > > > >
>> > > > > >    1. Define schema with protobuf.
>> > > > > >    2. Generate json data of this schema using other language
>> with
>> > > more
>> > > > > >    sophisticated implementation (e.g. java)
>> > > > > >    3. Generate parquet data of this schema using other language
>> with
>> > > more
>> > > > > >    sophisticated implementation (e.g. java)
>> > > > > >    4. Write tests to read json file, and parquet file into
>> memory
>> > > (arrow
>> > > > > >    array), then compare json data with arrow data.
>> > > > > >
>> > > > > >  I think with this method we can guarantee the correctness of
>> arrow
>> > > > > reader
>> > > > > > because json format is ubiquitous and their implementation are
>> more
>> > > > > stable.
>> > > > > >
>> > > > > > Any comment is appreciated.
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Renjie Liu
>> > > > Software Engineer, MVAD
>> > >
>> >
>> >
>> > --
>> > Renjie Liu
>> > Software Engineer, MVAD
>>
>

-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

It would be fine in that case.

Wes McKinney <we...@gmail.com> 于 2019年10月10日周四 下午12:58写道：

> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > 1. There already exists a low level parquet writer which can produce
> > parquet file, so unit test should be fine. But writer from arrow to
> parquet
> > doesn't exist yet, and it may take some period of time to finish it.
> > 2. In fact my data are randomly generated and it's definitely
> reproducible.
> > However, I don't think it would be good idea to randomly generate data
> > everytime we run ci because it would be difficult to debug. For example
> PR
> > a introduced a bug, which is triggerred in other PR's build it would be
> > confusing for contributors.
>
> Presumably any random data generation would use a fixed seed precisely
> to be reproducible.
>
> > 3. I think it would be good idea to spend effort on integration test with
> > parquet because it's an important use case of arrow. Also similar
> approach
> > could be extended to other language and other file format(avro, orc).
> >
> >
> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > There are a number of issues worth discussion.
> > >
> > > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> > > It's OK to be reliant on other libraries in the short term to produce
> > > files to test against, but does not strike me as a sustainable
> > > long-term plan. Fixing bugs can be a lot more difficult than it needs
> > > to be if you can't write targeted "endogenous" unit tests
> > >
> > > 2. Reproducible data generation
> > >
> > > I think if you're going to test against a pre-generated corpus, you
> > > should make sure that generating the corpus is reproducible for other
> > > developers (i.e. with a Dockerfile), and can be extended by adding new
> > > files or random data generation.
> > >
> > > I additionally would prefer generating the test corpus at test time
> > > rather than checking in binary files. If this isn't viable right now
> > > we can create an "arrow-rust-crutch" git repository for you to stash
> > > binary files until some of these testing scalability issues are
> > > addressed.
> > >
> > > If we're going to spend energy on Parquet integration testing with
> > > Java, this would be a good opportunity to do the work in a way where
> > > the C++ Parquet library can also participate (since we ought to be
> > > doing integration tests with Java, and we can also read JSON files to
> > > Arrow).
> > >
> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com>
> > > wrote:
> > > >
> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com>
> > > wrote:
> > > >
> > > > > I'm very interested in helping to find a solution to this because
> we
> > > really
> > > > > do need integration tests for Rust to make sure we're compatible
> with
> > > other
> > > > > implementations... there is also the ongoing CI dockerization work
> > > that I
> > > > > feel is related.
> > > > >
> > > > > I haven't looked at the current integration tests yet and would
> > > appreciate
> > > > > some pointers on how all of this works (do we have docs?) or where
> to
> > > start
> > > > > looking.
> > > > >
> > > > I have a test in my latest PR:
> https://github.com/apache/arrow/pull/5523
> > > > And here is the generated data:
> > > > https://github.com/apache/arrow-testing/pull/11
> > > > As with program to generate these data, it's just a simple java
> program.
> > > > I'm not sure whether we need to integrate it into arrow.
> > > >
> > > > >
> > > > > I imagine the integration test could follow the approach that
> Renjie is
> > > > > outlining where we call Java to generate some files and then call
> Rust
> > > to
> > > > > parse them?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Andy.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <liurenjie2008@gmail.com
> >
> > > wrote:
> > > > >
> > > > > > Hi:
> > > > > >
> > > > > > I'm developing rust version of reader which reads parquet into
> arrow
> > > > > array.
> > > > > > To verify the correct of this reader, I use the following
> approach:
> > > > > >
> > > > > >
> > > > > >    1. Define schema with protobuf.
> > > > > >    2. Generate json data of this schema using other language with
> > > more
> > > > > >    sophisticated implementation (e.g. java)
> > > > > >    3. Generate parquet data of this schema using other language
> with
> > > more
> > > > > >    sophisticated implementation (e.g. java)
> > > > > >    4. Write tests to read json file, and parquet file into memory
> > > (arrow
> > > > > >    array), then compare json data with arrow data.
> > > > > >
> > > > > >  I think with this method we can guarantee the correctness of
> arrow
> > > > > reader
> > > > > > because json format is ubiquitous and their implementation are
> more
> > > > > stable.
> > > > > >
> > > > > > Any comment is appreciated.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Renjie Liu
> > > > Software Engineer, MVAD
> > >
> >
> >
> > --
> > Renjie Liu
> > Software Engineer, MVAD
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <li...@gmail.com> wrote:
>
> 1. There already exists a low level parquet writer which can produce
> parquet file, so unit test should be fine. But writer from arrow to parquet
> doesn't exist yet, and it may take some period of time to finish it.
> 2. In fact my data are randomly generated and it's definitely reproducible.
> However, I don't think it would be good idea to randomly generate data
> everytime we run ci because it would be difficult to debug. For example PR
> a introduced a bug, which is triggerred in other PR's build it would be
> confusing for contributors.

Presumably any random data generation would use a fixed seed precisely
to be reproducible.

> 3. I think it would be good idea to spend effort on integration test with
> parquet because it's an important use case of arrow. Also similar approach
> could be extended to other language and other file format(avro, orc).
>
>
> On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com> wrote:
>
> > There are a number of issues worth discussion.
> >
> > 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> > It's OK to be reliant on other libraries in the short term to produce
> > files to test against, but does not strike me as a sustainable
> > long-term plan. Fixing bugs can be a lot more difficult than it needs
> > to be if you can't write targeted "endogenous" unit tests
> >
> > 2. Reproducible data generation
> >
> > I think if you're going to test against a pre-generated corpus, you
> > should make sure that generating the corpus is reproducible for other
> > developers (i.e. with a Dockerfile), and can be extended by adding new
> > files or random data generation.
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files. If this isn't viable right now
> > we can create an "arrow-rust-crutch" git repository for you to stash
> > binary files until some of these testing scalability issues are
> > addressed.
> >
> > If we're going to spend energy on Parquet integration testing with
> > Java, this would be a good opportunity to do the work in a way where
> > the C++ Parquet library can also participate (since we ought to be
> > doing integration tests with Java, and we can also read JSON files to
> > Arrow).
> >
> > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com>
> > wrote:
> > >
> > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com>
> > wrote:
> > >
> > > > I'm very interested in helping to find a solution to this because we
> > really
> > > > do need integration tests for Rust to make sure we're compatible with
> > other
> > > > implementations... there is also the ongoing CI dockerization work
> > that I
> > > > feel is related.
> > > >
> > > > I haven't looked at the current integration tests yet and would
> > appreciate
> > > > some pointers on how all of this works (do we have docs?) or where to
> > start
> > > > looking.
> > > >
> > > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > > And here is the generated data:
> > > https://github.com/apache/arrow-testing/pull/11
> > > As with program to generate these data, it's just a simple java program.
> > > I'm not sure whether we need to integrate it into arrow.
> > >
> > > >
> > > > I imagine the integration test could follow the approach that Renjie is
> > > > outlining where we call Java to generate some files and then call Rust
> > to
> > > > parse them?
> > > >
> > > > Thanks,
> > > >
> > > > Andy.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <li...@gmail.com>
> > wrote:
> > > >
> > > > > Hi:
> > > > >
> > > > > I'm developing rust version of reader which reads parquet into arrow
> > > > array.
> > > > > To verify the correct of this reader, I use the following approach:
> > > > >
> > > > >
> > > > >    1. Define schema with protobuf.
> > > > >    2. Generate json data of this schema using other language with
> > more
> > > > >    sophisticated implementation (e.g. java)
> > > > >    3. Generate parquet data of this schema using other language with
> > more
> > > > >    sophisticated implementation (e.g. java)
> > > > >    4. Write tests to read json file, and parquet file into memory
> > (arrow
> > > > >    array), then compare json data with arrow data.
> > > > >
> > > > >  I think with this method we can guarantee the correctness of arrow
> > > > reader
> > > > > because json format is ubiquitous and their implementation are more
> > > > stable.
> > > > >
> > > > > Any comment is appreciated.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Renjie Liu
> > > Software Engineer, MVAD
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

1. There already exists a low level parquet writer which can produce
parquet file, so unit test should be fine. But writer from arrow to parquet
doesn't exist yet, and it may take some period of time to finish it.
2. In fact my data are randomly generated and it's definitely reproducible.
However, I don't think it would be good idea to randomly generate data
everytime we run ci because it would be difficult to debug. For example PR
a introduced a bug, which is triggerred in other PR's build it would be
confusing for contributors.
3. I think it would be good idea to spend effort on integration test with
parquet because it's an important use case of arrow. Also similar approach
could be extended to other language and other file format(avro, orc).


On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <we...@gmail.com> wrote:

> There are a number of issues worth discussion.
>
> 1. What is the timeline/plan for Rust implementing a Parquet _writer_?
> It's OK to be reliant on other libraries in the short term to produce
> files to test against, but does not strike me as a sustainable
> long-term plan. Fixing bugs can be a lot more difficult than it needs
> to be if you can't write targeted "endogenous" unit tests
>
> 2. Reproducible data generation
>
> I think if you're going to test against a pre-generated corpus, you
> should make sure that generating the corpus is reproducible for other
> developers (i.e. with a Dockerfile), and can be extended by adding new
> files or random data generation.
>
> I additionally would prefer generating the test corpus at test time
> rather than checking in binary files. If this isn't viable right now
> we can create an "arrow-rust-crutch" git repository for you to stash
> binary files until some of these testing scalability issues are
> addressed.
>
> If we're going to spend energy on Parquet integration testing with
> Java, this would be a good opportunity to do the work in a way where
> the C++ Parquet library can also participate (since we ought to be
> doing integration tests with Java, and we can also read JSON files to
> Arrow).
>
> On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com>
> wrote:
> >
> > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com>
> wrote:
> >
> > > I'm very interested in helping to find a solution to this because we
> really
> > > do need integration tests for Rust to make sure we're compatible with
> other
> > > implementations... there is also the ongoing CI dockerization work
> that I
> > > feel is related.
> > >
> > > I haven't looked at the current integration tests yet and would
> appreciate
> > > some pointers on how all of this works (do we have docs?) or where to
> start
> > > looking.
> > >
> > I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> > And here is the generated data:
> > https://github.com/apache/arrow-testing/pull/11
> > As with program to generate these data, it's just a simple java program.
> > I'm not sure whether we need to integrate it into arrow.
> >
> > >
> > > I imagine the integration test could follow the approach that Renjie is
> > > outlining where we call Java to generate some files and then call Rust
> to
> > > parse them?
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <li...@gmail.com>
> wrote:
> > >
> > > > Hi:
> > > >
> > > > I'm developing rust version of reader which reads parquet into arrow
> > > array.
> > > > To verify the correct of this reader, I use the following approach:
> > > >
> > > >
> > > >    1. Define schema with protobuf.
> > > >    2. Generate json data of this schema using other language with
> more
> > > >    sophisticated implementation (e.g. java)
> > > >    3. Generate parquet data of this schema using other language with
> more
> > > >    sophisticated implementation (e.g. java)
> > > >    4. Write tests to read json file, and parquet file into memory
> (arrow
> > > >    array), then compare json data with arrow data.
> > > >
> > > >  I think with this method we can guarantee the correctness of arrow
> > > reader
> > > > because json format is ubiquitous and their implementation are more
> > > stable.
> > > >
> > > > Any comment is appreciated.
> > > >
> > >
> >
> >
> > --
> > Renjie Liu
> > Software Engineer, MVAD
>


-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Wes McKinney <we...@gmail.com>.

There are a number of issues worth discussion.

1. What is the timeline/plan for Rust implementing a Parquet _writer_?
It's OK to be reliant on other libraries in the short term to produce
files to test against, but does not strike me as a sustainable
long-term plan. Fixing bugs can be a lot more difficult than it needs
to be if you can't write targeted "endogenous" unit tests

2. Reproducible data generation

I think if you're going to test against a pre-generated corpus, you
should make sure that generating the corpus is reproducible for other
developers (i.e. with a Dockerfile), and can be extended by adding new
files or random data generation.

I additionally would prefer generating the test corpus at test time
rather than checking in binary files. If this isn't viable right now
we can create an "arrow-rust-crutch" git repository for you to stash
binary files until some of these testing scalability issues are
addressed.

If we're going to spend energy on Parquet integration testing with
Java, this would be a good opportunity to do the work in a way where
the C++ Parquet library can also participate (since we ought to be
doing integration tests with Java, and we can also read JSON files to
Arrow).

On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <li...@gmail.com> wrote:
>
> On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com> wrote:
>
> > I'm very interested in helping to find a solution to this because we really
> > do need integration tests for Rust to make sure we're compatible with other
> > implementations... there is also the ongoing CI dockerization work that I
> > feel is related.
> >
> > I haven't looked at the current integration tests yet and would appreciate
> > some pointers on how all of this works (do we have docs?) or where to start
> > looking.
> >
> I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
> And here is the generated data:
> https://github.com/apache/arrow-testing/pull/11
> As with program to generate these data, it's just a simple java program.
> I'm not sure whether we need to integrate it into arrow.
>
> >
> > I imagine the integration test could follow the approach that Renjie is
> > outlining where we call Java to generate some files and then call Rust to
> > parse them?
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <li...@gmail.com> wrote:
> >
> > > Hi:
> > >
> > > I'm developing rust version of reader which reads parquet into arrow
> > array.
> > > To verify the correct of this reader, I use the following approach:
> > >
> > >
> > >    1. Define schema with protobuf.
> > >    2. Generate json data of this schema using other language with more
> > >    sophisticated implementation (e.g. java)
> > >    3. Generate parquet data of this schema using other language with more
> > >    sophisticated implementation (e.g. java)
> > >    4. Write tests to read json file, and parquet file into memory (arrow
> > >    array), then compare json data with arrow data.
> > >
> > >  I think with this method we can guarantee the correctness of arrow
> > reader
> > > because json format is ubiquitous and their implementation are more
> > stable.
> > >
> > > Any comment is appreciated.
> > >
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Renjie Liu <li...@gmail.com>.

On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <an...@gmail.com> wrote:

> I'm very interested in helping to find a solution to this because we really
> do need integration tests for Rust to make sure we're compatible with other
> implementations... there is also the ongoing CI dockerization work that I
> feel is related.
>
> I haven't looked at the current integration tests yet and would appreciate
> some pointers on how all of this works (do we have docs?) or where to start
> looking.
>
I have a test in my latest PR: https://github.com/apache/arrow/pull/5523
And here is the generated data:
https://github.com/apache/arrow-testing/pull/11
As with program to generate these data, it's just a simple java program.
I'm not sure whether we need to integrate it into arrow.

>
> I imagine the integration test could follow the approach that Renjie is
> outlining where we call Java to generate some files and then call Rust to
> parse them?
>
> Thanks,
>
> Andy.
>
>
>
>
>
>
>
> On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <li...@gmail.com> wrote:
>
> > Hi:
> >
> > I'm developing rust version of reader which reads parquet into arrow
> array.
> > To verify the correct of this reader, I use the following approach:
> >
> >
> >    1. Define schema with protobuf.
> >    2. Generate json data of this schema using other language with more
> >    sophisticated implementation (e.g. java)
> >    3. Generate parquet data of this schema using other language with more
> >    sophisticated implementation (e.g. java)
> >    4. Write tests to read json file, and parquet file into memory (arrow
> >    array), then compare json data with arrow data.
> >
> >  I think with this method we can guarantee the correctness of arrow
> reader
> > because json format is ubiquitous and their implementation are more
> stable.
> >
> > Any comment is appreciated.
> >
>


-- 
Renjie Liu
Software Engineer, MVAD

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Posted by Andy Grove <an...@gmail.com>.

I'm very interested in helping to find a solution to this because we really
do need integration tests for Rust to make sure we're compatible with other
implementations... there is also the ongoing CI dockerization work that I
feel is related.

I haven't looked at the current integration tests yet and would appreciate
some pointers on how all of this works (do we have docs?) or where to start
looking.

I imagine the integration test could follow the approach that Renjie is
outlining where we call Java to generate some files and then call Rust to
parse them?

Thanks,

Andy.

On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <li...@gmail.com> wrote:

> Hi:
>
> I'm developing rust version of reader which reads parquet into arrow array.
> To verify the correct of this reader, I use the following approach:
>
>
>    1. Define schema with protobuf.
>    2. Generate json data of this schema using other language with more
>    sophisticated implementation (e.g. java)
>    3. Generate parquet data of this schema using other language with more
>    sophisticated implementation (e.g. java)
>    4. Write tests to read json file, and parquet file into memory (arrow
>    array), then compare json data with arrow data.
>
>  I think with this method we can guarantee the correctness of arrow reader
> because json format is ubiquitous and their implementation are more stable.
>
> Any comment is appreciated.
>