You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Steve Loughran <st...@cloudera.com.INVALID> on 2022/06/08 15:24:04 UTC

[DISCUSS] Filesystem API shim library to assist applications still targeting previous hadoop releases.

I've just created an initial project "fs-api-shim" to provide controlled
access to the hadoop 3.3.3+ filesystem API calls on hadoop 3.2.0+ releases
https://github.com/steveloughran/fs-api-shim

The goal here is to make it possible for core file format libraries
(Parquet, Avro, ORC, Arrow etc) and other apps (HBase, ...) to take
advantage of those APIs which we have updated and optimised for access to
cloud stores. Currently the applications do not and are under performance
on recent releases. I have the ability to change our internal forks but I
would like to let others gain from the changes and avoid having to diverge
i'll internal libraries too much.

Currently too many libraries seen frozen in time

Avro: still rejecting changes which don't compile on hadoop 2
https://github.com/apache/avro/pull/1431

Parquet: still using reflection to access non hadoop 1.x filesystem API
calls
https://github.com/apache/parquet-mr/pull/971

I'm not going to support hadoop 2.10 —but we can at least say "move up to
hadoop 3.2.x and we will let you use later APIs when available"

some calls, like openFile() will work everywhere; on versions with the open
file builder API they will take the final status and fake policy so let
libraries declare whether they are random/sequential is IO and skip those
HEAD requests on the object stores they do to verify that the file exists
and determine its length for the ranged GET call requests which will follow.

https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L38

On Hadoop 3.2.x, or if openFile() fails for some reason, it will just
downgrade to the classic open() call.

Other API calls we can support dynamic binding to through reflection but
not actually fallback if they are unavailable. This will allow libraries to
use the API calls if present but force them to come up with alternative
solutions if not.

A key part of this is FSDataInputStream, where the ByteBufferReadable API
would be benefit to Parquet

https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java

When we get the vectored IO feature branch in, we can offer similar
reflection-based access. It means applications can compile on hadoop 3.2.x
and 3.3.x but still take advantage of the APIs when they are on a version
without it.

I'm going to stay clear of more complicated APIs which don't offer tangible
performance gains and which are very hard to do (IOStatistics).

Testing is fun; I have a plan there which consists of FS contract tests in
the shim test source tree to verify the 3.2.0 functionality and an adjacent
module which will run those same tests against more recent versions. I need
test will have to beat targetable against objects doors as well as local
and mini HGFS for systems

This is all in github; however it is very much a hadoop extension library.
Is there a way we could release it as an ASF Library but on a different
timetable from normal Hadoop releases? There is always incubator, but this
is such a minor project it is closer to the org.apache.hadoop.thirdparty
library in that it is something all current committers okay should be able
to commit to and release, while releasing on a schedule independent of
hadoop releases themselves. Having it come from this project should give it
more legitimacy.

Steve

Re: [DISCUSS] Filesystem API shim library to assist applications still targeting previous hadoop releases.

Posted by Ayush Saxena <ay...@gmail.com>.
Just answering the last point:
>
> Is there a way we could release it as an ASF Library but on a different
> timetable from normal Hadoop releases? There is always incubator, but this
> is such a minor project it is closer to the org.apache.hadoop.thirdparty
> library in that it is something all current committers okay should be able
> to commit to and release, while releasing on a schedule independent of
> hadoop releases themselves. Having it come from this project should give it
> more legitimacy.


Possible options I can think:

   - Checkin as part of hadoop trunk code as a separate module and make
   sure it isn't part of the normal release, like ozone & submarine were doing
   in the early days, they were part of the hadoop code base, but were
   following a different release cycle.
   - Get it in as a separate repository under hadoop, like
   hadoop-thirdparty and again how ozone & submarine were operating
   just before leaving.
   - Incubator stuff: which you already said no. but the option is still
   there if all fail.
   - Can adjust as a module in hadoop-thirdparty as well and pair the
   release with thirdparty release, but might not make sense because of the
   name 'thirdparty' and it will still have release dependencies for you.


The easiest might be the first option, cleanest might be second. In case
you tend to have a separate repo or something like that you need to
setup the Jenkins jobs and all to run the PreCommit stuff and have some
test coverage as well for the code you are checking in.

Whatever option you choose, I think that would require a formal vote,
atleast the 2nd & 4th option would, 3rd I don't know how they operate, For
1st also better to have one to prevent people coming and shouting in the
end. :-)

-Ayush

On Wed, 8 Jun 2022 at 20:55, Steve Loughran <st...@cloudera.com.invalid>
wrote:

> I've just created an initial project "fs-api-shim" to provide controlled
> access to the hadoop 3.3.3+ filesystem API calls on hadoop 3.2.0+ releases
> https://github.com/steveloughran/fs-api-shim
>
> The goal here is to make it possible for core file format libraries
> (Parquet, Avro, ORC, Arrow etc) and other apps (HBase, ...) to take
> advantage of those APIs which we have updated and optimised for access to
> cloud stores. Currently the applications do not and are under performance
> on recent releases. I have the ability to change our internal forks but I
> would like to let others gain from the changes and avoid having to diverge
> i'll internal libraries too much.
>
> Currently too many libraries seen frozen in time
>
> Avro: still rejecting changes which don't compile on hadoop 2
> https://github.com/apache/avro/pull/1431
>
> Parquet: still using reflection to access non hadoop 1.x filesystem API
> calls
> https://github.com/apache/parquet-mr/pull/971
>
> I'm not going to support hadoop 2.10 —but we can at least say "move up to
> hadoop 3.2.x and we will let you use later APIs when available"
>
> some calls, like openFile() will work everywhere; on versions with the open
> file builder API they will take the final status and fake policy so let
> libraries declare whether they are random/sequential is IO and skip those
> HEAD requests on the object stores they do to verify that the file exists
> and determine its length for the ranged GET call requests which will
> follow.
>
>
> https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L38
>
> On Hadoop 3.2.x, or if openFile() fails for some reason, it will just
> downgrade to the classic open() call.
>
> Other API calls we can support dynamic binding to through reflection but
> not actually fallback if they are unavailable. This will allow libraries to
> use the API calls if present but force them to come up with alternative
> solutions if not.
>
> A key part of this is FSDataInputStream, where the ByteBufferReadable API
> would be benefit to Parquet
>
>
> https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java
>
> When we get the vectored IO feature branch in, we can offer similar
> reflection-based access. It means applications can compile on hadoop 3.2.x
> and 3.3.x but still take advantage of the APIs when they are on a version
> without it.
>
> I'm going to stay clear of more complicated APIs which don't offer tangible
> performance gains and which are very hard to do (IOStatistics).
>
> Testing is fun; I have a plan there which consists of FS contract tests in
> the shim test source tree to verify the 3.2.0 functionality and an adjacent
> module which will run those same tests against more recent versions. I need
> test will have to beat targetable against objects doors as well as local
> and mini HGFS for systems
>
> This is all in github; however it is very much a hadoop extension library.
> Is there a way we could release it as an ASF Library but on a different
> timetable from normal Hadoop releases? There is always incubator, but this
> is such a minor project it is closer to the org.apache.hadoop.thirdparty
> library in that it is something all current committers okay should be able
> to commit to and release, while releasing on a schedule independent of
> hadoop releases themselves. Having it come from this project should give it
> more legitimacy.
>
> Steve
>

Re: [DISCUSS] Filesystem API shim library to assist applications still targeting previous hadoop releases.

Posted by Ayush Saxena <ay...@gmail.com>.
Just answering the last point:
>
> Is there a way we could release it as an ASF Library but on a different
> timetable from normal Hadoop releases? There is always incubator, but this
> is such a minor project it is closer to the org.apache.hadoop.thirdparty
> library in that it is something all current committers okay should be able
> to commit to and release, while releasing on a schedule independent of
> hadoop releases themselves. Having it come from this project should give it
> more legitimacy.


Possible options I can think:

   - Checkin as part of hadoop trunk code as a separate module and make
   sure it isn't part of the normal release, like ozone & submarine were doing
   in the early days, they were part of the hadoop code base, but were
   following a different release cycle.
   - Get it in as a separate repository under hadoop, like
   hadoop-thirdparty and again how ozone & submarine were operating
   just before leaving.
   - Incubator stuff: which you already said no. but the option is still
   there if all fail.
   - Can adjust as a module in hadoop-thirdparty as well and pair the
   release with thirdparty release, but might not make sense because of the
   name 'thirdparty' and it will still have release dependencies for you.


The easiest might be the first option, cleanest might be second. In case
you tend to have a separate repo or something like that you need to
setup the Jenkins jobs and all to run the PreCommit stuff and have some
test coverage as well for the code you are checking in.

Whatever option you choose, I think that would require a formal vote,
atleast the 2nd & 4th option would, 3rd I don't know how they operate, For
1st also better to have one to prevent people coming and shouting in the
end. :-)

-Ayush

On Wed, 8 Jun 2022 at 20:55, Steve Loughran <st...@cloudera.com.invalid>
wrote:

> I've just created an initial project "fs-api-shim" to provide controlled
> access to the hadoop 3.3.3+ filesystem API calls on hadoop 3.2.0+ releases
> https://github.com/steveloughran/fs-api-shim
>
> The goal here is to make it possible for core file format libraries
> (Parquet, Avro, ORC, Arrow etc) and other apps (HBase, ...) to take
> advantage of those APIs which we have updated and optimised for access to
> cloud stores. Currently the applications do not and are under performance
> on recent releases. I have the ability to change our internal forks but I
> would like to let others gain from the changes and avoid having to diverge
> i'll internal libraries too much.
>
> Currently too many libraries seen frozen in time
>
> Avro: still rejecting changes which don't compile on hadoop 2
> https://github.com/apache/avro/pull/1431
>
> Parquet: still using reflection to access non hadoop 1.x filesystem API
> calls
> https://github.com/apache/parquet-mr/pull/971
>
> I'm not going to support hadoop 2.10 —but we can at least say "move up to
> hadoop 3.2.x and we will let you use later APIs when available"
>
> some calls, like openFile() will work everywhere; on versions with the open
> file builder API they will take the final status and fake policy so let
> libraries declare whether they are random/sequential is IO and skip those
> HEAD requests on the object stores they do to verify that the file exists
> and determine its length for the ranged GET call requests which will
> follow.
>
>
> https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L38
>
> On Hadoop 3.2.x, or if openFile() fails for some reason, it will just
> downgrade to the classic open() call.
>
> Other API calls we can support dynamic binding to through reflection but
> not actually fallback if they are unavailable. This will allow libraries to
> use the API calls if present but force them to come up with alternative
> solutions if not.
>
> A key part of this is FSDataInputStream, where the ByteBufferReadable API
> would be benefit to Parquet
>
>
> https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java
>
> When we get the vectored IO feature branch in, we can offer similar
> reflection-based access. It means applications can compile on hadoop 3.2.x
> and 3.3.x but still take advantage of the APIs when they are on a version
> without it.
>
> I'm going to stay clear of more complicated APIs which don't offer tangible
> performance gains and which are very hard to do (IOStatistics).
>
> Testing is fun; I have a plan there which consists of FS contract tests in
> the shim test source tree to verify the 3.2.0 functionality and an adjacent
> module which will run those same tests against more recent versions. I need
> test will have to beat targetable against objects doors as well as local
> and mini HGFS for systems
>
> This is all in github; however it is very much a hadoop extension library.
> Is there a way we could release it as an ASF Library but on a different
> timetable from normal Hadoop releases? There is always incubator, but this
> is such a minor project it is closer to the org.apache.hadoop.thirdparty
> library in that it is something all current committers okay should be able
> to commit to and release, while releasing on a schedule independent of
> hadoop releases themselves. Having it come from this project should give it
> more legitimacy.
>
> Steve
>