You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Chao Sun <su...@apache.org> on 2018/12/27 20:10:08 UTC

[Rust] move parquet into a separate sub-crate

Hi,

It just occurs to me that it may be a better idea to move the parquet
module into a separate sub-crate by using cargo workspaces
<https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
advantage is that we can make the project more modular (in future, we may
want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
etc), and allow us to run CI jobs separately on each crate.

Some small caveats:
1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
depends on arrow, we can't reference parquet in arrow. This doesn't seem
like an issue though since arrow itself should be physical on-disk format
independent. I also didn't see any reference on parquet in cpp/src/arrow.
2. The path dependency used in workspace has to be changed to a version
number when we do "cargo publish". This should be added to the release
instructions and committer who performs the job should do the extra step.

Thoughts?

Chao

Re: [Rust] move parquet into a separate sub-crate

Posted by Chao Sun <su...@apache.org>.
Thanks Krisztian, I think you are right at the beginning. I was concerned
about the crate dependency but didn't know this can be solved by cargo
workspace + path dependency.

I've filed ARROW-4137 with a PR.

Chao

On Sun, Dec 30, 2018 at 4:32 AM Krisztián Szűcs <sz...@gmail.com>
wrote:

> Hi!
>
> I support the idea[1]!
>
> Cheers, Krisztian
>
> [1] https://github.com/apache/arrow/pull/3050#discussion_r237412529
>
> On Thu, Dec 27, 2018 at 9:10 PM Chao Sun <su...@apache.org> wrote:
>
> > Hi,
> >
> > It just occurs to me that it may be a better idea to move the parquet
> > module into a separate sub-crate by using cargo workspaces
> > <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> > advantage is that we can make the project more modular (in future, we may
> > want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> > etc), and allow us to run CI jobs separately on each crate.
> >
> > Some small caveats:
> > 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> > depends on arrow, we can't reference parquet in arrow. This doesn't seem
> > like an issue though since arrow itself should be physical on-disk format
> > independent. I also didn't see any reference on parquet in cpp/src/arrow.
> > 2. The path dependency used in workspace has to be changed to a version
> > number when we do "cargo publish". This should be added to the release
> > instructions and committer who performs the job should do the extra step.
> >
> > Thoughts?
> >
> > Chao
> >
>

Re: [Rust] move parquet into a separate sub-crate

Posted by Krisztián Szűcs <sz...@gmail.com>.
Hi!

I support the idea[1]!

Cheers, Krisztian

[1] https://github.com/apache/arrow/pull/3050#discussion_r237412529

On Thu, Dec 27, 2018 at 9:10 PM Chao Sun <su...@apache.org> wrote:

> Hi,
>
> It just occurs to me that it may be a better idea to move the parquet
> module into a separate sub-crate by using cargo workspaces
> <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> advantage is that we can make the project more modular (in future, we may
> want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> etc), and allow us to run CI jobs separately on each crate.
>
> Some small caveats:
> 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> depends on arrow, we can't reference parquet in arrow. This doesn't seem
> like an issue though since arrow itself should be physical on-disk format
> independent. I also didn't see any reference on parquet in cpp/src/arrow.
> 2. The path dependency used in workspace has to be changed to a version
> number when we do "cargo publish". This should be added to the release
> instructions and committer who performs the job should do the extra step.
>
> Thoughts?
>
> Chao
>

Re: [Rust] move parquet into a separate sub-crate

Posted by paddy horan <pa...@hotmail.com>.
Ok, thanks Chao.

Sounds good to me then.

P

________________________________
From: Chao Sun <su...@apache.org>
Sent: Saturday, December 29, 2018 2:20 AM
To: dev@arrow.apache.org
Subject: Re: [Rust] move parquet into a separate sub-crate

Thanks Paddy. Similarly, I can't see a reason for arrow to reference
parquet since it is all about the in-memory data representation and tools
to build and process the data. In case it does, we probably should move the
logic to the parquet side.

IMO the arrow-parquet integration (e.g., reading from parquet to arrow,
writing arrow into parquet) should happen in parquet side as it involves
encoding/decoding mechanisms which are specific to the latter. Therefore,
parquet needs to depend on arrow. The cargo workspace is pretty flexible
about this so a sub-crate is allowed to depend on the main crate.

On Fri, Dec 28, 2018 at 6:55 PM paddy horan <pa...@hotmail.com> wrote:

> This seems reasonable.  The flexibility with CI is a positive for sure.
>
> > 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> > depends on arrow, we can't reference parquet in arrow.
>
> This is my only concern, the Rust implementation is evolving rapidly and
> adopting workspaces may reduce our flexibility, I can’t think of a specific
> situation right now but might this be a problem in the future?
>
> If we have this restriction we have to agree on which way dependencies
> flow, what is your preference?
>
> I have not used workspaces in anger but it seems it is designed for crates
> to flow up?  i.e. that arrow would be allowed to reference the parquet
> sub-crate.
>
> P
>
> From: Renjie Liu<ma...@gmail.com>
> Sent: Thursday, December 27, 2018 10:09 PM
> To: dev@arrow.apache.org<ma...@arrow.apache.org>
> Subject: Re: [Rust] move parquet into a separate sub-crate
>
> Cool. It may also be worthy to put adapters into a separate crate.
>
> On Fri, Dec 28, 2018 at 4:10 AM Chao Sun <su...@apache.org> wrote:
>
> > Hi,
> >
> > It just occurs to me that it may be a better idea to move the parquet
> > module into a separate sub-crate by using cargo workspaces
> > <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> > advantage is that we can make the project more modular (in future, we may
> > want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> > etc), and allow us to run CI jobs separately on each crate.
> >
> > Some small caveats:
> > 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> > depends on arrow, we can't reference parquet in arrow. This doesn't seem
> > like an issue though since arrow itself should be physical on-disk format
> > independent. I also didn't see any reference on parquet in cpp/src/arrow.
> > 2. The path dependency used in workspace has to be changed to a version
> > number when we do "cargo publish". This should be added to the release
> > instructions and committer who performs the job should do the extra step.
> >
> > Thoughts?
> >
> > Chao
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD
>
>

Re: [Rust] move parquet into a separate sub-crate

Posted by Chao Sun <su...@apache.org>.
Thanks Paddy. Similarly, I can't see a reason for arrow to reference
parquet since it is all about the in-memory data representation and tools
to build and process the data. In case it does, we probably should move the
logic to the parquet side.

IMO the arrow-parquet integration (e.g., reading from parquet to arrow,
writing arrow into parquet) should happen in parquet side as it involves
encoding/decoding mechanisms which are specific to the latter. Therefore,
parquet needs to depend on arrow. The cargo workspace is pretty flexible
about this so a sub-crate is allowed to depend on the main crate.

On Fri, Dec 28, 2018 at 6:55 PM paddy horan <pa...@hotmail.com> wrote:

> This seems reasonable.  The flexibility with CI is a positive for sure.
>
> > 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> > depends on arrow, we can't reference parquet in arrow.
>
> This is my only concern, the Rust implementation is evolving rapidly and
> adopting workspaces may reduce our flexibility, I can’t think of a specific
> situation right now but might this be a problem in the future?
>
> If we have this restriction we have to agree on which way dependencies
> flow, what is your preference?
>
> I have not used workspaces in anger but it seems it is designed for crates
> to flow up?  i.e. that arrow would be allowed to reference the parquet
> sub-crate.
>
> P
>
> From: Renjie Liu<ma...@gmail.com>
> Sent: Thursday, December 27, 2018 10:09 PM
> To: dev@arrow.apache.org<ma...@arrow.apache.org>
> Subject: Re: [Rust] move parquet into a separate sub-crate
>
> Cool. It may also be worthy to put adapters into a separate crate.
>
> On Fri, Dec 28, 2018 at 4:10 AM Chao Sun <su...@apache.org> wrote:
>
> > Hi,
> >
> > It just occurs to me that it may be a better idea to move the parquet
> > module into a separate sub-crate by using cargo workspaces
> > <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> > advantage is that we can make the project more modular (in future, we may
> > want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> > etc), and allow us to run CI jobs separately on each crate.
> >
> > Some small caveats:
> > 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> > depends on arrow, we can't reference parquet in arrow. This doesn't seem
> > like an issue though since arrow itself should be physical on-disk format
> > independent. I also didn't see any reference on parquet in cpp/src/arrow.
> > 2. The path dependency used in workspace has to be changed to a version
> > number when we do "cargo publish". This should be added to the release
> > instructions and committer who performs the job should do the extra step.
> >
> > Thoughts?
> >
> > Chao
> >
>
>
> --
> Renjie Liu
> Software Engineer, MVAD
>
>

RE: [Rust] move parquet into a separate sub-crate

Posted by paddy horan <pa...@hotmail.com>.
This seems reasonable.  The flexibility with CI is a positive for sure.

> 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> depends on arrow, we can't reference parquet in arrow.

This is my only concern, the Rust implementation is evolving rapidly and adopting workspaces may reduce our flexibility, I can’t think of a specific situation right now but might this be a problem in the future?

If we have this restriction we have to agree on which way dependencies flow, what is your preference?

I have not used workspaces in anger but it seems it is designed for crates to flow up?  i.e. that arrow would be allowed to reference the parquet sub-crate.

P

From: Renjie Liu<ma...@gmail.com>
Sent: Thursday, December 27, 2018 10:09 PM
To: dev@arrow.apache.org<ma...@arrow.apache.org>
Subject: Re: [Rust] move parquet into a separate sub-crate

Cool. It may also be worthy to put adapters into a separate crate.

On Fri, Dec 28, 2018 at 4:10 AM Chao Sun <su...@apache.org> wrote:

> Hi,
>
> It just occurs to me that it may be a better idea to move the parquet
> module into a separate sub-crate by using cargo workspaces
> <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> advantage is that we can make the project more modular (in future, we may
> want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> etc), and allow us to run CI jobs separately on each crate.
>
> Some small caveats:
> 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> depends on arrow, we can't reference parquet in arrow. This doesn't seem
> like an issue though since arrow itself should be physical on-disk format
> independent. I also didn't see any reference on parquet in cpp/src/arrow.
> 2. The path dependency used in workspace has to be changed to a version
> number when we do "cargo publish". This should be added to the release
> instructions and committer who performs the job should do the extra step.
>
> Thoughts?
>
> Chao
>


--
Renjie Liu
Software Engineer, MVAD


Re: [Rust] move parquet into a separate sub-crate

Posted by Renjie Liu <li...@gmail.com>.
Cool. It may also be worthy to put adapters into a separate crate.

On Fri, Dec 28, 2018 at 4:10 AM Chao Sun <su...@apache.org> wrote:

> Hi,
>
> It just occurs to me that it may be a better idea to move the parquet
> module into a separate sub-crate by using cargo workspaces
> <https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html>. The
> advantage is that we can make the project more modular (in future, we may
> want to add more sub-crates such as arrow/parquet_derive, orc, gandiva,
> etc), and allow us to run CI jobs separately on each crate.
>
> Some small caveats:
> 1. Cargo doesn't allow cyclic dependency. So if the parquet sub-crate
> depends on arrow, we can't reference parquet in arrow. This doesn't seem
> like an issue though since arrow itself should be physical on-disk format
> independent. I also didn't see any reference on parquet in cpp/src/arrow.
> 2. The path dependency used in workspace has to be changed to a version
> number when we do "cargo publish". This should be added to the release
> instructions and committer who performs the job should do the extra step.
>
> Thoughts?
>
> Chao
>


-- 
Renjie Liu
Software Engineer, MVAD