You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Atour Mousavi Gourabi <at...@live.com> on 2023/06/08 16:24:37 UTC

Parquet without Hadoop dependencies

Dear all,

The Java implementations of the Parquet readers and writers seem pretty tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this can cause issues as it's an unnecessary and big dependency when you might just need to write to disk. Is there any appetite here for separating the Hadoop code and supporting more convenient ways to write to disk out of the box? I am willing to work on these changes but would like some pointers on whether such patches would be reviewed and accepted as PARQUET-1822 has been open for over three years now.

Best regards,
Atour Mousavi Gourabi

Re: Parquet without Hadoop dependencies

Posted by Gang Wu <us...@gmail.com>.

Yes, a PR would be welcome!

On Sat, Jun 10, 2023 at 7:50 PM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Hi Gang,
>
> The breaking changes are a valid concern, so I agree we should consult
> with downstream communities before releasing any.
> Right now, we do already make limited use of the interfaces you describe
> (for the filesystem). These enable users to read and write Parquet without
> installing Hadoop on their system in a slightly convoluted way. They will
> also still need to package the Hadoop dependencies, but it's something I
> think we should support by providing the implementations they'd need to
> make this work out of the box. I can have a PR open for this quickly if you
> agree we should support it.
> As for not packaging hadoop-client-runtime, we would need to first include
> the implementations described above, and then introduce some abstraction
> over at least the Hadoop Configuration. I think this should be feasible to
> implement in a non-breaking way, though I could not give you a timeline.
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <us...@gmail.com>
> Sent: Saturday, June 10, 2023 12:20 PM
> To: dev@parquet.apache.org <de...@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> My main concern of breaking change is the effort to take for downstream
> projects to adopt the new parquet release. We need to hear more voices
> from those communities to make a consensus if breaking changes are
> acceptable.
>
> I just took a glance at hadoop dependencies, it seems the major ones are
> used for configuration, filesystem and codec. Could we introduce a layer
> of interfaces for them and make those hadoop classes as concrete
> implementations of them? I think this is the first step to split the core
> features
> of parquet from hadoop.
>
> Back to the hadoop-client-api proposal, my intention is to support basic
> parquet
> features with only hadoop-client-api pulled in the dependencies. And use
> full feature
> with hadoop-client-runtime pulled. Is that possible?
>
> On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Hi Gang,
> >
> > I don't think it's feasible to make a new module for it this way as a lot
> > of the support for this part of the code (codecs, etc.) resides in
> > parquet-hadoop. This means the module would likely require a dependency
> on
> > parquet-hadoop, making it pretty useless. This could be avoided by
> porting
> > the supporting classes over to this new core module, but that could cause
> > similar issues.
> > As for replacing the Hadoop dependencies by hadoop-client-api and
> > hadoop-client-runtime, this could indeed be nice for some use-cases. It
> > could avoid a big chunk of the Hadoop related issues, though we still
> > require users to package parts of it. There are some convoluted ways this
> > can be achieved now, which we could support out of the box, at least for
> > writing to disk. I would like to think of this as more of a temporary
> > solution though, as we would still be forcing pretty big dependencies on
> > users that oftentimes do not need them.
> > It seems to me that properly decoupling the reader/writer code from this
> > dependency will likely require breaking changes in the future as it is
> > hardwired in a large part of the logic. Maybe something to consider for
> the
> > next major release?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <us...@gmail.com>
> > Sent: Friday, June 9, 2023 4:32 PM
> > To: dev@parquet.apache.org <de...@parquet.apache.org>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > That may break many downstream projects. At least we cannot break
> > parquet-hadoop (and any existing module). If you can add a new module
> > like parquet-core and provide limited reader/writer features without
> hadoop
> > support, and then make parquet-hadoop depend on parquet-core, that
> > would be acceptable.
> >
> > One possible workaround is to replace various Hadoop dependencies
> > by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> > may be much easier for users to add Hadoop dependency. But they are
> > only available from Hadoop 3.0.0.
> >
> > On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com>
> > wrote:
> >
> > > Hi Gang,
> > >
> > > Backward compatibility does indeed seem challenging here. Especially as
> > > I'd rather see the writers/readers moved out of parquet-hadoop after
> > > they've been decoupled. What are your thoughts on this?
> > >
> > > Best regards,
> > > Atour
> > > ________________________________
> > > From: Gang Wu <us...@gmail.com>
> > > Sent: Friday, June 9, 2023 3:32 AM
> > > To: dev@parquet.apache.org <de...@parquet.apache.org>
> > > Subject: Re: Parquet without Hadoop dependencies
> > >
> > > Hi Atour,
> > >
> > > Thanks for bringing this up!
> > >
> > > From what I observed from PARQUET-1822, I think it is a valid use
> > > case to support parquet reading/writing without hadoop installed.
> > > The challenge is backward compatibility. It would be great if you can
> > > work on it.
> > >
> > > Best,
> > > Gang
> > >
> > > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> > > wrote:
> > >
> > > > Dear all,
> > > >
> > > > The Java implementations of the Parquet readers and writers seem
> pretty
> > > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects,
> this
> > > can
> > > > cause issues as it's an unnecessary and big dependency when you might
> > > just
> > > > need to write to disk. Is there any appetite here for separating the
> > > Hadoop
> > > > code and supporting more convenient ways to write to disk out of the
> > > box? I
> > > > am willing to work on these changes but would like some pointers on
> > > whether
> > > > such patches would be reviewed and accepted as PARQUET-1822 has been
> > open
> > > > for over three years now.
> > > >
> > > > Best regards,
> > > > Atour Mousavi Gourabi
> > > >
> > >
> >
>

Re: Parquet without Hadoop dependencies

Posted by Atour Mousavi Gourabi <at...@live.com>.

Hi Gang,

The breaking changes are a valid concern, so I agree we should consult with downstream communities before releasing any.
Right now, we do already make limited use of the interfaces you describe (for the filesystem). These enable users to read and write Parquet without installing Hadoop on their system in a slightly convoluted way. They will also still need to package the Hadoop dependencies, but it's something I think we should support by providing the implementations they'd need to make this work out of the box. I can have a PR open for this quickly if you agree we should support it.
As for not packaging hadoop-client-runtime, we would need to first include the implementations described above, and then introduce some abstraction over at least the Hadoop Configuration. I think this should be feasible to implement in a non-breaking way, though I could not give you a timeline.

Best regards,
Atour
________________________________
From: Gang Wu <us...@gmail.com>
Sent: Saturday, June 10, 2023 12:20 PM
To: dev@parquet.apache.org <de...@parquet.apache.org>
Subject: Re: Parquet without Hadoop dependencies

My main concern of breaking change is the effort to take for downstream
projects to adopt the new parquet release. We need to hear more voices
from those communities to make a consensus if breaking changes are
acceptable.

I just took a glance at hadoop dependencies, it seems the major ones are
used for configuration, filesystem and codec. Could we introduce a layer
of interfaces for them and make those hadoop classes as concrete
implementations of them? I think this is the first step to split the core
features
of parquet from hadoop.

Back to the hadoop-client-api proposal, my intention is to support basic
parquet
features with only hadoop-client-api pulled in the dependencies. And use
full feature
with hadoop-client-runtime pulled. Is that possible?

On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Hi Gang,
>
> I don't think it's feasible to make a new module for it this way as a lot
> of the support for this part of the code (codecs, etc.) resides in
> parquet-hadoop. This means the module would likely require a dependency on
> parquet-hadoop, making it pretty useless. This could be avoided by porting
> the supporting classes over to this new core module, but that could cause
> similar issues.
> As for replacing the Hadoop dependencies by hadoop-client-api and
> hadoop-client-runtime, this could indeed be nice for some use-cases. It
> could avoid a big chunk of the Hadoop related issues, though we still
> require users to package parts of it. There are some convoluted ways this
> can be achieved now, which we could support out of the box, at least for
> writing to disk. I would like to think of this as more of a temporary
> solution though, as we would still be forcing pretty big dependencies on
> users that oftentimes do not need them.
> It seems to me that properly decoupling the reader/writer code from this
> dependency will likely require breaking changes in the future as it is
> hardwired in a large part of the logic. Maybe something to consider for the
> next major release?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <us...@gmail.com>
> Sent: Friday, June 9, 2023 4:32 PM
> To: dev@parquet.apache.org <de...@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> That may break many downstream projects. At least we cannot break
> parquet-hadoop (and any existing module). If you can add a new module
> like parquet-core and provide limited reader/writer features without hadoop
> support, and then make parquet-hadoop depend on parquet-core, that
> would be acceptable.
>
> One possible workaround is to replace various Hadoop dependencies
> by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> may be much easier for users to add Hadoop dependency. But they are
> only available from Hadoop 3.0.0.
>
> On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Hi Gang,
> >
> > Backward compatibility does indeed seem challenging here. Especially as
> > I'd rather see the writers/readers moved out of parquet-hadoop after
> > they've been decoupled. What are your thoughts on this?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <us...@gmail.com>
> > Sent: Friday, June 9, 2023 3:32 AM
> > To: dev@parquet.apache.org <de...@parquet.apache.org>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > Hi Atour,
> >
> > Thanks for bringing this up!
> >
> > From what I observed from PARQUET-1822, I think it is a valid use
> > case to support parquet reading/writing without hadoop installed.
> > The challenge is backward compatibility. It would be great if you can
> > work on it.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> > wrote:
> >
> > > Dear all,
> > >
> > > The Java implementations of the Parquet readers and writers seem pretty
> > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> > can
> > > cause issues as it's an unnecessary and big dependency when you might
> > just
> > > need to write to disk. Is there any appetite here for separating the
> > Hadoop
> > > code and supporting more convenient ways to write to disk out of the
> > box? I
> > > am willing to work on these changes but would like some pointers on
> > whether
> > > such patches would be reviewed and accepted as PARQUET-1822 has been
> open
> > > for over three years now.
> > >
> > > Best regards,
> > > Atour Mousavi Gourabi
> > >
> >
>

Re: Parquet without Hadoop dependencies

Posted by Gang Wu <us...@gmail.com>.

My main concern of breaking change is the effort to take for downstream
projects to adopt the new parquet release. We need to hear more voices
from those communities to make a consensus if breaking changes are
acceptable.

I just took a glance at hadoop dependencies, it seems the major ones are
used for configuration, filesystem and codec. Could we introduce a layer
of interfaces for them and make those hadoop classes as concrete
implementations of them? I think this is the first step to split the core
features
of parquet from hadoop.

Back to the hadoop-client-api proposal, my intention is to support basic
parquet
features with only hadoop-client-api pulled in the dependencies. And use
full feature
with hadoop-client-runtime pulled. Is that possible?

On Sat, Jun 10, 2023 at 4:27 AM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Hi Gang,
>
> I don't think it's feasible to make a new module for it this way as a lot
> of the support for this part of the code (codecs, etc.) resides in
> parquet-hadoop. This means the module would likely require a dependency on
> parquet-hadoop, making it pretty useless. This could be avoided by porting
> the supporting classes over to this new core module, but that could cause
> similar issues.
> As for replacing the Hadoop dependencies by hadoop-client-api and
> hadoop-client-runtime, this could indeed be nice for some use-cases. It
> could avoid a big chunk of the Hadoop related issues, though we still
> require users to package parts of it. There are some convoluted ways this
> can be achieved now, which we could support out of the box, at least for
> writing to disk. I would like to think of this as more of a temporary
> solution though, as we would still be forcing pretty big dependencies on
> users that oftentimes do not need them.
> It seems to me that properly decoupling the reader/writer code from this
> dependency will likely require breaking changes in the future as it is
> hardwired in a large part of the logic. Maybe something to consider for the
> next major release?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <us...@gmail.com>
> Sent: Friday, June 9, 2023 4:32 PM
> To: dev@parquet.apache.org <de...@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> That may break many downstream projects. At least we cannot break
> parquet-hadoop (and any existing module). If you can add a new module
> like parquet-core and provide limited reader/writer features without hadoop
> support, and then make parquet-hadoop depend on parquet-core, that
> would be acceptable.
>
> One possible workaround is to replace various Hadoop dependencies
> by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
> may be much easier for users to add Hadoop dependency. But they are
> only available from Hadoop 3.0.0.
>
> On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Hi Gang,
> >
> > Backward compatibility does indeed seem challenging here. Especially as
> > I'd rather see the writers/readers moved out of parquet-hadoop after
> > they've been decoupled. What are your thoughts on this?
> >
> > Best regards,
> > Atour
> > ________________________________
> > From: Gang Wu <us...@gmail.com>
> > Sent: Friday, June 9, 2023 3:32 AM
> > To: dev@parquet.apache.org <de...@parquet.apache.org>
> > Subject: Re: Parquet without Hadoop dependencies
> >
> > Hi Atour,
> >
> > Thanks for bringing this up!
> >
> > From what I observed from PARQUET-1822, I think it is a valid use
> > case to support parquet reading/writing without hadoop installed.
> > The challenge is backward compatibility. It would be great if you can
> > work on it.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> > wrote:
> >
> > > Dear all,
> > >
> > > The Java implementations of the Parquet readers and writers seem pretty
> > > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> > can
> > > cause issues as it's an unnecessary and big dependency when you might
> > just
> > > need to write to disk. Is there any appetite here for separating the
> > Hadoop
> > > code and supporting more convenient ways to write to disk out of the
> > box? I
> > > am willing to work on these changes but would like some pointers on
> > whether
> > > such patches would be reviewed and accepted as PARQUET-1822 has been
> open
> > > for over three years now.
> > >
> > > Best regards,
> > > Atour Mousavi Gourabi
> > >
> >
>

Re: Parquet without Hadoop dependencies

Posted by Atour Mousavi Gourabi <at...@live.com>.

Hi Gang,

I don't think it's feasible to make a new module for it this way as a lot of the support for this part of the code (codecs, etc.) resides in parquet-hadoop. This means the module would likely require a dependency on parquet-hadoop, making it pretty useless. This could be avoided by porting the supporting classes over to this new core module, but that could cause similar issues.
As for replacing the Hadoop dependencies by hadoop-client-api and hadoop-client-runtime, this could indeed be nice for some use-cases. It could avoid a big chunk of the Hadoop related issues, though we still require users to package parts of it. There are some convoluted ways this can be achieved now, which we could support out of the box, at least for writing to disk. I would like to think of this as more of a temporary solution though, as we would still be forcing pretty big dependencies on users that oftentimes do not need them.
It seems to me that properly decoupling the reader/writer code from this dependency will likely require breaking changes in the future as it is hardwired in a large part of the logic. Maybe something to consider for the next major release?

Best regards,
Atour
________________________________
From: Gang Wu <us...@gmail.com>
Sent: Friday, June 9, 2023 4:32 PM
To: dev@parquet.apache.org <de...@parquet.apache.org>
Subject: Re: Parquet without Hadoop dependencies

That may break many downstream projects. At least we cannot break
parquet-hadoop (and any existing module). If you can add a new module
like parquet-core and provide limited reader/writer features without hadoop
support, and then make parquet-hadoop depend on parquet-core, that
would be acceptable.

One possible workaround is to replace various Hadoop dependencies
by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
may be much easier for users to add Hadoop dependency. But they are
only available from Hadoop 3.0.0.

On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com> wrote:

> Hi Gang,
>
> Backward compatibility does indeed seem challenging here. Especially as
> I'd rather see the writers/readers moved out of parquet-hadoop after
> they've been decoupled. What are your thoughts on this?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <us...@gmail.com>
> Sent: Friday, June 9, 2023 3:32 AM
> To: dev@parquet.apache.org <de...@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> Hi Atour,
>
> Thanks for bringing this up!
>
> From what I observed from PARQUET-1822, I think it is a valid use
> case to support parquet reading/writing without hadoop installed.
> The challenge is backward compatibility. It would be great if you can
> work on it.
>
> Best,
> Gang
>
> On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Dear all,
> >
> > The Java implementations of the Parquet readers and writers seem pretty
> > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> can
> > cause issues as it's an unnecessary and big dependency when you might
> just
> > need to write to disk. Is there any appetite here for separating the
> Hadoop
> > code and supporting more convenient ways to write to disk out of the
> box? I
> > am willing to work on these changes but would like some pointers on
> whether
> > such patches would be reviewed and accepted as PARQUET-1822 has been open
> > for over three years now.
> >
> > Best regards,
> > Atour Mousavi Gourabi
> >
>

Re: Parquet without Hadoop dependencies

Posted by Gang Wu <us...@gmail.com>.

That may break many downstream projects. At least we cannot break
parquet-hadoop (and any existing module). If you can add a new module
like parquet-core and provide limited reader/writer features without hadoop
support, and then make parquet-hadoop depend on parquet-core, that
would be acceptable.

One possible workaround is to replace various Hadoop dependencies
by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
may be much easier for users to add Hadoop dependency. But they are
only available from Hadoop 3.0.0.

On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com> wrote:

> Hi Gang,
>
> Backward compatibility does indeed seem challenging here. Especially as
> I'd rather see the writers/readers moved out of parquet-hadoop after
> they've been decoupled. What are your thoughts on this?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <us...@gmail.com>
> Sent: Friday, June 9, 2023 3:32 AM
> To: dev@parquet.apache.org <de...@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> Hi Atour,
>
> Thanks for bringing this up!
>
> From what I observed from PARQUET-1822, I think it is a valid use
> case to support parquet reading/writing without hadoop installed.
> The challenge is backward compatibility. It would be great if you can
> work on it.
>
> Best,
> Gang
>
> On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Dear all,
> >
> > The Java implementations of the Parquet readers and writers seem pretty
> > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> can
> > cause issues as it's an unnecessary and big dependency when you might
> just
> > need to write to disk. Is there any appetite here for separating the
> Hadoop
> > code and supporting more convenient ways to write to disk out of the
> box? I
> > am willing to work on these changes but would like some pointers on
> whether
> > such patches would be reviewed and accepted as PARQUET-1822 has been open
> > for over three years now.
> >
> > Best regards,
> > Atour Mousavi Gourabi
> >
>

Re: Parquet without Hadoop dependencies

Posted by Atour Mousavi Gourabi <at...@live.com>.

Hi Gang,

Backward compatibility does indeed seem challenging here. Especially as I'd rather see the writers/readers moved out of parquet-hadoop after they've been decoupled. What are your thoughts on this?

Best regards,
Atour
________________________________
From: Gang Wu <us...@gmail.com>
Sent: Friday, June 9, 2023 3:32 AM
To: dev@parquet.apache.org <de...@parquet.apache.org>
Subject: Re: Parquet without Hadoop dependencies

Hi Atour,

Thanks for bringing this up!

From what I observed from PARQUET-1822, I think it is a valid use
case to support parquet reading/writing without hadoop installed.
The challenge is backward compatibility. It would be great if you can
work on it.

Best,
Gang

On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Dear all,
>
> The Java implementations of the Parquet readers and writers seem pretty
> tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this can
> cause issues as it's an unnecessary and big dependency when you might just
> need to write to disk. Is there any appetite here for separating the Hadoop
> code and supporting more convenient ways to write to disk out of the box? I
> am willing to work on these changes but would like some pointers on whether
> such patches would be reviewed and accepted as PARQUET-1822 has been open
> for over three years now.
>
> Best regards,
> Atour Mousavi Gourabi
>

Re: Parquet without Hadoop dependencies

Posted by Gang Wu <us...@gmail.com>.

Hi Atour,

Thanks for bringing this up!

From what I observed from PARQUET-1822, I think it is a valid use
case to support parquet reading/writing without hadoop installed.
The challenge is backward compatibility. It would be great if you can
work on it.

Best,
Gang

On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Dear all,
>
> The Java implementations of the Parquet readers and writers seem pretty
> tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this can
> cause issues as it's an unnecessary and big dependency when you might just
> need to write to disk. Is there any appetite here for separating the Hadoop
> code and supporting more convenient ways to write to disk out of the box? I
> am willing to work on these changes but would like some pointers on whether
> such patches would be reviewed and accepted as PARQUET-1822 has been open
> for over three years now.
>
> Best regards,
> Atour Mousavi Gourabi
>