You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Jim Clampffer <ja...@gmail.com> on 2018/02/28 17:55:10 UTC

[DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Hi everyone,

I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
into trunk.  I sent originally sent a similar email out last October but it
sounds like it was buried by discussions about other feature merges that
were going on at the time.

libhdfs++ is an HDFS client written in C++ designed to be used in
applications that are written in non-JVM based languages.  In its current
state it supports kerberos authenticated reads from HDFS and has been used
in production clusters for over a year so it has had a significant amount
of burn-in time.  The HDFS-8707 branch has been around for about 2 years
now so I'd like to know people's thoughts on what it would take to merge
current branch and handling writes and encrypted reads in a new one.

Current notable features:
  -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
drop-in replacement for clients that only need read support (until
libhdfs++ also supports writes).
  -An asynchronous C++ API with synchronous shims on top if the client
application wants to do blocking operations.  Internally a single thread
(optionally more) uses select/epoll by way of boost::asio to watch
thousands of sockets without the overhead of spawning threads to emulate
async operation.
  -Kerberos/SASL authentication support
  -HA namenode support
  -A set of utility programs that mirror the HDFS CLI utilities e.g.
"./hdfs dfs -chmod".  The major benefit of these is the tool startup time
is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
lot less memory since it isn't dealing with the JVM.  This makes it
possible to do things like write a simple bash script that stats a file,
applies some rules to the result, and decides if it should move it in a way
that scales to thousands of files without being penalized with O(N) JVM
startups.
  -Cancelable reads.  This has proven to be very useful in multiuser
applications that (pre)fetch large blocks of data but need to remain
responsive for interactive users.  Rather than waiting for a large and/or
slow read to finish it will return immediately and the associated resources
(buffer, file descriptor) become available for the rest of the application
to use.

There are a couple known issues: the doc build isn't integrated with the
rest of hadoop and the public API headers aren't being exported when
building a distribution.  A short term solution for missing docs is to go
through the libhdfs(3) compatible API and use the libhdfs docs.  Other than
a few modifications to the pom files to integrate the build and the changes
are isolated to a new directory so the chance of causing any regressions in
the rest of the code is minimal.

Please share your thoughts, thanks!

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by Chris Douglas <cd...@apache.org>.

On Thu, Mar 1, 2018 at 10:04 AM, Jim Clampffer
<ja...@gmail.com> wrote:
> Chris, do you mean potentially landing this in its current state and
> handling some of the rough edges after?  I could see this working just
> because there's no impact on any existing code.

Yes. Better to get this committed and released than to polish it in
the branch. -C

> With regards to your questions Kai:
> There isn't a good doc for the internal architecture yet; I just reassigned
> HDFS-9115 to myself to handle that.  Are there any specific areas you'd like
> to know about so I can prioritize those?
> Here's some header files that include a lot of comments that should help out
> for now:
> -hdfspp.h - main header for the C++ API
> -filesystem.h and filehandle.h - describes some rules about object lifetimes
> and threading from the API point of view (most classes have comments
> describing any restrictions on threading, locking, and lifecycle).
> -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> implementation.
>
>
> 1) Yes, it's a reimplementation of the entire client in C++.  Using libhdfs3
> as a reference helps a lot here but it's still a lot of work.
> 2) EC isn't supported now, though that'd be great to have, and I agree that
> it's going to be take a lot of effort to implement.  Right now if you tried
> to read an EC file I think you'd get some unhelpful error out of the block
> reader but I don't have an EC enabled cluster set up to test.  Adding an
> explicit not supported message would be straightforward.
> 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already had
> so we get consistency checks on the C API.  There's a few new tests that
> also get run on both libhdfs and libhdfs++ and make sure the expected output
> is the same too.
> 4) I agree, I just haven't had a chance to look into the distribution build
> to see how to do it.  HDFS-9465 is tracking this.
> 5) Not yet (HDFS-8765).
>
> Regards,
> James
>
>
>
>
> On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zh...@alibaba-inc.com> wrote:
>>
>> The work sounds solid and great! + to have this.
>>
>> Is there any quick doc to take a glance at? Some quick questions to be
>> familiar with:
>> 1. Seems the client is all implemented in c++ without any Java codes (so
>> no JVM overhead), which means lots of work, rewriting HDFS client. Right?
>> 2.  Guess erasure coding feature isn't supported, as it'd involve
>> significant development, right? If yes, what will it say when read erasure
>> coded file?
>> 3. Is there any building/testing mechanism to enforce the consistency
>> between the c++ part and Java part?
>> 4. I thought the public header and lib should be exported when building
>> the distribution package, otherwise hard to use the new C api.
>> 5. Is the short-circuit read supported?
>>
>> Thanks.
>>
>>
>> Regards,
>> Kai
>>
>> ------------------------------------------------------------------
>> 发件人：Chris Douglas <cd...@apache.org>
>> 发送时间：2018年3月1日(星期四) 05:08
>> 收件人：Jim Clampffer <ja...@gmail.com>
>> 抄　送：Hdfs-dev <hd...@hadoop.apache.org>
>> 主　题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
>>
>> +1
>>
>> Let's get this done. We've had many false starts on a native HDFS
>> client. This is a good base to build on. -C
>>
>> On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
>> <ja...@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > I'd like to start a thread to discuss merging the HDFS-8707 aka
>> > libhdfs++
>> > into trunk.  I sent originally sent a similar email out last October but
>> > it
>> > sounds like it was buried by discussions about other feature merges that
>> > were going on at the time.
>> >
>> > libhdfs++ is an HDFS client written in C++ designed to be used in
>> > applications that are written in non-JVM based languages.  In its
>> > current
>> > state it supports kerberos authenticated reads from HDFS and has been
>> > used
>> > in production clusters for over a year so it has had a significant
>> > amount
>> > of burn-in time.  The HDFS-8707 branch has been around for about 2 years
>> > now so I'd like to know people's thoughts on what it would take to merge
>> > current branch and handling writes and encrypted reads in a new one.
>> >
>> > Current notable features:
>> >   -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
>> > a
>> > drop-in replacement for clients that only need read support (until
>> > libhdfs++ also supports writes).
>> >   -An asynchronous C++ API with synchronous shims on top if the client
>> > application wants to do blocking operations.  Internally a single thread
>> > (optionally more) uses select/epoll by way of boost::asio to watch
>> > thousands of sockets without the overhead of spawning threads to emulate
>> > async operation.
>> >   -Kerberos/SASL authentication support
>> >   -HA namenode support
>> >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
>> > "./hdfs dfs -chmod".  The major benefit of these is the tool startup
>> > time
>> > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
>> > lot less memory since it isn't dealing with the JVM.  This makes it
>> > possible to do things like write a simple bash script that stats a file,
>> > applies some rules to the result, and decides if it should move it in a
>> > way
>> > that scales to thousands of files without being penalized with O(N) JVM
>> > startups.
>> >   -Cancelable reads.  This has proven to be very useful in multiuser
>> > applications that (pre)fetch large blocks of data but need to remain
>> > responsive for interactive users.  Rather than waiting for a large
>> > and/or
>> > slow read to finish it will return immediately and the associated
>> > resources
>> > (buffer, file descriptor) become available for the rest of the
>> > application
>> > to use.
>> >
>> > There are a couple known issues: the doc build isn't integrated with the
>> > rest of hadoop and the public API headers aren't being exported when
>> > building a distribution.  A short term solution for missing docs is to
>> > go
>> > through the libhdfs(3) compatible API and use the libhdfs docs.  Other
>> > than
>> > a few modifications to the pom files to integrate the build and the
>> > changes
>> > are isolated to a new directory so the chance of causing any regressions
>> > in
>> > the rest of the code is minimal.
>> >
>> > Please share your thoughts, thanks!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by Owen O'Malley <ow...@gmail.com>.

+1 on the merge. We've been using it on the trunk of ORC for a while. It
will be great to have it released by Hadoop.

.. Owen

On Thu, Mar 1, 2018 at 10:31 AM, Vinayakumar B <vi...@apache.org>
wrote:

> Definitely this would be great addition. Kudos to everyone's contributions.
>
> I am not a C++ expert. So cannot vote on code.
>
>   ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
> a drop-in replacement for clients that only need read support (until
> libhdfs++
> also supports writes).
>
> Wouldn't it be nice to have write support as well before merge...?
> If everyone feels its okay to have read alone for now, I am okay anyway.
>
> On 1 Mar 2018 11:35 pm, "Jim Clampffer" <ja...@gmail.com> wrote:
>
> > Thanks for the feedback Chris and Kai!
> >
> > Chris, do you mean potentially landing this in its current state and
> > handling some of the rough edges after?  I could see this working just
> > because there's no impact on any existing code.
> >
> > With regards to your questions Kai:
> > There isn't a good doc for the internal architecture yet; I just
> reassigned
> > HDFS-9115 to myself to handle that.  Are there any specific areas you'd
> > like to know about so I can prioritize those?
> > Here's some header files that include a lot of comments that should help
> > out for now:
> > -hdfspp.h - main header for the C++ API
> > -filesystem.h and filehandle.h - describes some rules about object
> > lifetimes and threading from the API point of view (most classes have
> > comments describing any restrictions on threading, locking, and
> lifecycle).
> > -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> > implementation.
> >
> >
> > 1) Yes, it's a reimplementation of the entire client in C++.  Using
> > libhdfs3 as a reference helps a lot here but it's still a lot of work.
> > 2) EC isn't supported now, though that'd be great to have, and I agree
> that
> > it's going to be take a lot of effort to implement.  Right now if you
> tried
> > to read an EC file I think you'd get some unhelpful error out of the
> block
> > reader but I don't have an EC enabled cluster set up to test.  Adding an
> > explicit not supported message would be straightforward.
> > 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
> > had so we get consistency checks on the C API.  There's a few new tests
> > that also get run on both libhdfs and libhdfs++ and make sure the
> expected
> > output is the same too.
> > 4) I agree, I just haven't had a chance to look into the distribution
> build
> > to see how to do it.  HDFS-9465 is tracking this.
> > 5) Not yet (HDFS-8765).
> >
> > Regards,
> > James
> >
> >
> >
> >
> > On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zh...@alibaba-inc.com>
> > wrote:
> >
> > > The work sounds solid and great! + to have this.
> > >
> > > Is there any quick doc to take a glance at? Some quick questions to be
> > > familiar with:
> > > 1. Seems the client is all implemented in c++ without any Java codes
> (so
> > > no JVM overhead), which means lots of work, rewriting HDFS client.
> Right?
> > > 2.  Guess erasure coding feature isn't supported, as it'd involve
> > > significant development, right? If yes, what will it say when read
> > erasure
> > > coded file?
> > > 3. Is there any building/testing mechanism to enforce the consistency
> > > between the c++ part and Java part?
> > > 4. I thought the public header and lib should be exported when building
> > > the distribution package, otherwise hard to use the new C api.
> > > 5. Is the short-circuit read supported?
> > >
> > > Thanks.
> > >
> > >
> > > Regards,
> > > Kai
> > >
> > > ------------------------------------------------------------------
> > > 发件人：Chris Douglas <cd...@apache.org>
> > > 发送时间：2018年3月1日(星期四) 05:08
> > > 收件人：Jim Clampffer <ja...@gmail.com>
> > > 抄 送：Hdfs-dev <hd...@hadoop.apache.org>
> > > 主 题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
> > >
> > > +1
> > >
> > > Let's get this done. We've had many false starts on a native HDFS
> > > client. This is a good base to build on. -C
> > >
> > > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> > > <ja...@gmail.com> wrote:
> > > > Hi everyone,
> > > >
> > > > I'd like to start a thread to discuss merging the HDFS-
> > > 8707 aka libhdfs++
> > > > into trunk.  I sent originally sent a similar
> > > email out last October but it
> > > > sounds like it was buried by discussions about other feature merges
> > that
> > > > were going on at the time.
> > > >
> > > > libhdfs++ is an HDFS client written in C++ designed to be used in
> > > > applications that are written in non-JVM based
> > > languages.  In its current
> > > > state it supports kerberos authenticated reads from HDFS
> > > and has been used
> > > > in production clusters for over a year so it has had a
> > > significant amount
> > > > of burn-in time.  The HDFS-8707 branch has been around for about 2
> > years
> > > > now so I'd like to know people's thoughts on what it would take to
> > merge
> > > > current branch and handling writes and encrypted reads in a new one.
> > > >
> > > > Current notable features:
> > > >   -A libhdfs/libhdfs3 compatible C API that allows
> > > libhdfs++ to serve as a
> > > > drop-in replacement for clients that only need read support (until
> > > > libhdfs++ also supports writes).
> > > >   -An asynchronous C++ API with synchronous shims on top if the
> client
> > > > application wants to do blocking operations.  Internally a single
> > thread
> > > > (optionally more) uses select/epoll by way of boost::asio to watch
> > > > thousands of sockets without the overhead of spawning threads to
> > emulate
> > > > async operation.
> > > >   -Kerberos/SASL authentication support
> > > >   -HA namenode support
> > > >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > > > "./hdfs dfs -chmod".  The major benefit of these is the
> > > tool startup time
> > > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and
> occupies
> > a
> > > > lot less memory since it isn't dealing with the JVM.  This makes it
> > > > possible to do things like write a simple bash script that stats a
> > file,
> > > > applies some rules to the result, and decides if it
> > > should move it in a way
> > > > that scales to thousands of files without being penalized with O(N)
> JVM
> > > > startups.
> > > >   -Cancelable reads.  This has proven to be very useful in multiuser
> > > > applications that (pre)fetch large blocks of data but need to remain
> > > > responsive for interactive users.  Rather than waiting
> > > for a large and/or
> > > > slow read to finish it will return immediately and the
> > > associated resources
> > > > (buffer, file descriptor) become available for the rest
> > > of the application
> > > > to use.
> > > >
> > > > There are a couple known issues: the doc build isn't integrated with
> > the
> > > > rest of hadoop and the public API headers aren't being exported when
> > > > building a distribution.  A short term solution for
> > > missing docs is to go
> > > > through the libhdfs(3) compatible API and use the
> > > libhdfs docs.  Other than
> > > > a few modifications to the pom files to integrate the
> > > build and the changes
> > > > are isolated to a new directory so the chance of
> > > causing any regressions in
> > > > the rest of the code is minimal.
> > > >
> > > > Please share your thoughts, thanks!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> > > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by Vinayakumar B <vi...@apache.org>.

Definitely this would be great addition. Kudos to everyone's contributions.

I am not a C++ expert. So cannot vote on code.

  ---A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as
a drop-in replacement for clients that only need read support (until libhdfs++
also supports writes).

Wouldn't it be nice to have write support as well before merge...?
If everyone feels its okay to have read alone for now, I am okay anyway.

On 1 Mar 2018 11:35 pm, "Jim Clampffer" <ja...@gmail.com> wrote:

> Thanks for the feedback Chris and Kai!
>
> Chris, do you mean potentially landing this in its current state and
> handling some of the rough edges after?  I could see this working just
> because there's no impact on any existing code.
>
> With regards to your questions Kai:
> There isn't a good doc for the internal architecture yet; I just reassigned
> HDFS-9115 to myself to handle that.  Are there any specific areas you'd
> like to know about so I can prioritize those?
> Here's some header files that include a lot of comments that should help
> out for now:
> -hdfspp.h - main header for the C++ API
> -filesystem.h and filehandle.h - describes some rules about object
> lifetimes and threading from the API point of view (most classes have
> comments describing any restrictions on threading, locking, and lifecycle).
> -rpc_engine.h and rpc_connection.h begin getting into the async RPC
> implementation.
>
>
> 1) Yes, it's a reimplementation of the entire client in C++.  Using
> libhdfs3 as a reference helps a lot here but it's still a lot of work.
> 2) EC isn't supported now, though that'd be great to have, and I agree that
> it's going to be take a lot of effort to implement.  Right now if you tried
> to read an EC file I think you'd get some unhelpful error out of the block
> reader but I don't have an EC enabled cluster set up to test.  Adding an
> explicit not supported message would be straightforward.
> 3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
> had so we get consistency checks on the C API.  There's a few new tests
> that also get run on both libhdfs and libhdfs++ and make sure the expected
> output is the same too.
> 4) I agree, I just haven't had a chance to look into the distribution build
> to see how to do it.  HDFS-9465 is tracking this.
> 5) Not yet (HDFS-8765).
>
> Regards,
> James
>
>
>
>
> On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zh...@alibaba-inc.com>
> wrote:
>
> > The work sounds solid and great! + to have this.
> >
> > Is there any quick doc to take a glance at? Some quick questions to be
> > familiar with:
> > 1. Seems the client is all implemented in c++ without any Java codes (so
> > no JVM overhead), which means lots of work, rewriting HDFS client. Right?
> > 2.  Guess erasure coding feature isn't supported, as it'd involve
> > significant development, right? If yes, what will it say when read
> erasure
> > coded file?
> > 3. Is there any building/testing mechanism to enforce the consistency
> > between the c++ part and Java part?
> > 4. I thought the public header and lib should be exported when building
> > the distribution package, otherwise hard to use the new C api.
> > 5. Is the short-circuit read supported?
> >
> > Thanks.
> >
> >
> > Regards,
> > Kai
> >
> > ------------------------------------------------------------------
> > 发件人：Chris Douglas <cd...@apache.org>
> > 发送时间：2018年3月1日(星期四) 05:08
> > 收件人：Jim Clampffer <ja...@gmail.com>
> > 抄 送：Hdfs-dev <hd...@hadoop.apache.org>
> > 主 题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
> >
> > +1
> >
> > Let's get this done. We've had many false starts on a native HDFS
> > client. This is a good base to build on. -C
> >
> > On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> > <ja...@gmail.com> wrote:
> > > Hi everyone,
> > >
> > > I'd like to start a thread to discuss merging the HDFS-
> > 8707 aka libhdfs++
> > > into trunk.  I sent originally sent a similar
> > email out last October but it
> > > sounds like it was buried by discussions about other feature merges
> that
> > > were going on at the time.
> > >
> > > libhdfs++ is an HDFS client written in C++ designed to be used in
> > > applications that are written in non-JVM based
> > languages.  In its current
> > > state it supports kerberos authenticated reads from HDFS
> > and has been used
> > > in production clusters for over a year so it has had a
> > significant amount
> > > of burn-in time.  The HDFS-8707 branch has been around for about 2
> years
> > > now so I'd like to know people's thoughts on what it would take to
> merge
> > > current branch and handling writes and encrypted reads in a new one.
> > >
> > > Current notable features:
> > >   -A libhdfs/libhdfs3 compatible C API that allows
> > libhdfs++ to serve as a
> > > drop-in replacement for clients that only need read support (until
> > > libhdfs++ also supports writes).
> > >   -An asynchronous C++ API with synchronous shims on top if the client
> > > application wants to do blocking operations.  Internally a single
> thread
> > > (optionally more) uses select/epoll by way of boost::asio to watch
> > > thousands of sockets without the overhead of spawning threads to
> emulate
> > > async operation.
> > >   -Kerberos/SASL authentication support
> > >   -HA namenode support
> > >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > > "./hdfs dfs -chmod".  The major benefit of these is the
> > tool startup time
> > > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies
> a
> > > lot less memory since it isn't dealing with the JVM.  This makes it
> > > possible to do things like write a simple bash script that stats a
> file,
> > > applies some rules to the result, and decides if it
> > should move it in a way
> > > that scales to thousands of files without being penalized with O(N) JVM
> > > startups.
> > >   -Cancelable reads.  This has proven to be very useful in multiuser
> > > applications that (pre)fetch large blocks of data but need to remain
> > > responsive for interactive users.  Rather than waiting
> > for a large and/or
> > > slow read to finish it will return immediately and the
> > associated resources
> > > (buffer, file descriptor) become available for the rest
> > of the application
> > > to use.
> > >
> > > There are a couple known issues: the doc build isn't integrated with
> the
> > > rest of hadoop and the public API headers aren't being exported when
> > > building a distribution.  A short term solution for
> > missing docs is to go
> > > through the libhdfs(3) compatible API and use the
> > libhdfs docs.  Other than
> > > a few modifications to the pom files to integrate the
> > build and the changes
> > > are isolated to a new directory so the chance of
> > causing any regressions in
> > > the rest of the code is minimal.
> > >
> > > Please share your thoughts, thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
> >
> >
>

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by Jim Clampffer <ja...@gmail.com>.

Thanks for the feedback Chris and Kai!

Chris, do you mean potentially landing this in its current state and
handling some of the rough edges after?  I could see this working just
because there's no impact on any existing code.

With regards to your questions Kai:
There isn't a good doc for the internal architecture yet; I just reassigned
HDFS-9115 to myself to handle that.  Are there any specific areas you'd
like to know about so I can prioritize those?
Here's some header files that include a lot of comments that should help
out for now:
-hdfspp.h - main header for the C++ API
-filesystem.h and filehandle.h - describes some rules about object
lifetimes and threading from the API point of view (most classes have
comments describing any restrictions on threading, locking, and lifecycle).
-rpc_engine.h and rpc_connection.h begin getting into the async RPC
implementation.


1) Yes, it's a reimplementation of the entire client in C++.  Using
libhdfs3 as a reference helps a lot here but it's still a lot of work.
2) EC isn't supported now, though that'd be great to have, and I agree that
it's going to be take a lot of effort to implement.  Right now if you tried
to read an EC file I think you'd get some unhelpful error out of the block
reader but I don't have an EC enabled cluster set up to test.  Adding an
explicit not supported message would be straightforward.
3) libhdfs++ reuses all of the minidfscluster tests that libhdfs already
had so we get consistency checks on the C API.  There's a few new tests
that also get run on both libhdfs and libhdfs++ and make sure the expected
output is the same too.
4) I agree, I just haven't had a chance to look into the distribution build
to see how to do it.  HDFS-9465 is tracking this.
5) Not yet (HDFS-8765).

Regards,
James




On Thu, Mar 1, 2018 at 4:28 AM, 郑锴(铁杰) <zh...@alibaba-inc.com> wrote:

> The work sounds solid and great! + to have this.
>
> Is there any quick doc to take a glance at? Some quick questions to be
> familiar with:
> 1. Seems the client is all implemented in c++ without any Java codes (so
> no JVM overhead), which means lots of work, rewriting HDFS client. Right?
> 2.  Guess erasure coding feature isn't supported, as it'd involve
> significant development, right? If yes, what will it say when read erasure
> coded file?
> 3. Is there any building/testing mechanism to enforce the consistency
> between the c++ part and Java part?
> 4. I thought the public header and lib should be exported when building
> the distribution package, otherwise hard to use the new C api.
> 5. Is the short-circuit read supported?
>
> Thanks.
>
>
> Regards,
> Kai
>
> ------------------------------------------------------------------
> 发件人：Chris Douglas <cd...@apache.org>
> 发送时间：2018年3月1日(星期四) 05:08
> 收件人：Jim Clampffer <ja...@gmail.com>
> 抄 送：Hdfs-dev <hd...@hadoop.apache.org>
> 主 题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
>
> +1
>
> Let's get this done. We've had many false starts on a native HDFS
> client. This is a good base to build on. -C
>
> On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
> <ja...@gmail.com> wrote:
> > Hi everyone,
> >
> > I'd like to start a thread to discuss merging the HDFS-
> 8707 aka libhdfs++
> > into trunk.  I sent originally sent a similar
> email out last October but it
> > sounds like it was buried by discussions about other feature merges that
> > were going on at the time.
> >
> > libhdfs++ is an HDFS client written in C++ designed to be used in
> > applications that are written in non-JVM based
> languages.  In its current
> > state it supports kerberos authenticated reads from HDFS
> and has been used
> > in production clusters for over a year so it has had a
> significant amount
> > of burn-in time.  The HDFS-8707 branch has been around for about 2 years
> > now so I'd like to know people's thoughts on what it would take to merge
> > current branch and handling writes and encrypted reads in a new one.
> >
> > Current notable features:
> >   -A libhdfs/libhdfs3 compatible C API that allows
> libhdfs++ to serve as a
> > drop-in replacement for clients that only need read support (until
> > libhdfs++ also supports writes).
> >   -An asynchronous C++ API with synchronous shims on top if the client
> > application wants to do blocking operations.  Internally a single thread
> > (optionally more) uses select/epoll by way of boost::asio to watch
> > thousands of sockets without the overhead of spawning threads to emulate
> > async operation.
> >   -Kerberos/SASL authentication support
> >   -HA namenode support
> >   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> > "./hdfs dfs -chmod".  The major benefit of these is the
> tool startup time
> > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
> > lot less memory since it isn't dealing with the JVM.  This makes it
> > possible to do things like write a simple bash script that stats a file,
> > applies some rules to the result, and decides if it
> should move it in a way
> > that scales to thousands of files without being penalized with O(N) JVM
> > startups.
> >   -Cancelable reads.  This has proven to be very useful in multiuser
> > applications that (pre)fetch large blocks of data but need to remain
> > responsive for interactive users.  Rather than waiting
> for a large and/or
> > slow read to finish it will return immediately and the
> associated resources
> > (buffer, file descriptor) become available for the rest
> of the application
> > to use.
> >
> > There are a couple known issues: the doc build isn't integrated with the
> > rest of hadoop and the public API headers aren't being exported when
> > building a distribution.  A short term solution for
> missing docs is to go
> > through the libhdfs(3) compatible API and use the
> libhdfs docs.  Other than
> > a few modifications to the pom files to integrate the
> build and the changes
> > are isolated to a new directory so the chance of
> causing any regressions in
> > the rest of the code is minimal.
> >
> > Please share your thoughts, thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org
>
>

回复：[DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by "郑锴(铁杰)" <zh...@alibaba-inc.com>.

The work sounds solid and great! + to have this.
Is there any quick doc to take a glance at? Some quick questions to be familiar with:1. Seems the client is all implemented in c++ without any Java codes (so no JVM overhead), which means lots of work, rewriting HDFS client. Right?2.  Guess erasure coding feature isn't supported, as it'd involve significant development, right? If yes, what will it say when read erasure coded file?3. Is there any building/testing mechanism to enforce the consistency between the c++ part and Java part?4. I thought the public header and lib should be exported when building the distribution package, otherwise hard to use the new C api.5. Is the short-circuit read supported?
Thanks.

Regards,Kai
------------------------------------------------------------------发件人：Chris Douglas <cd...@apache.org>发送时间：2018年3月1日(星期四) 05:08收件人：Jim Clampffer <ja...@gmail.com>抄　送：Hdfs-dev <hd...@hadoop.apache.org>主　题：Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk
+1

Let's get this done. We've had many false starts on a native HDFS
client. This is a good base to build on. -C

On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
<ja...@gmail.com> wrote:
> Hi everyone,
>
> I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
> into trunk.  I sent originally sent a similar email out last October but it
> sounds like it was buried by discussions about other feature merges that
> were going on at the time.
>
> libhdfs++ is an HDFS client written in C++ designed to be used in
> applications that are written in non-JVM based languages.  In its current
> state it supports kerberos authenticated reads from HDFS and has been used
> in production clusters for over a year so it has had a significant amount
> of burn-in time.  The HDFS-8707 branch has been around for about 2 years
> now so I'd like to know people's thoughts on what it would take to merge
> current branch and handling writes and encrypted reads in a new one.
>
> Current notable features:
>   -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
> drop-in replacement for clients that only need read support (until
> libhdfs++ also supports writes).
>   -An asynchronous C++ API with synchronous shims on top if the client
> application wants to do blocking operations.  Internally a single thread
> (optionally more) uses select/epoll by way of boost::asio to watch
> thousands of sockets without the overhead of spawning threads to emulate
> async operation.
>   -Kerberos/SASL authentication support
>   -HA namenode support
>   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> "./hdfs dfs -chmod".  The major benefit of these is the tool startup time
> is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
> lot less memory since it isn't dealing with the JVM.  This makes it
> possible to do things like write a simple bash script that stats a file,
> applies some rules to the result, and decides if it should move it in a way
> that scales to thousands of files without being penalized with O(N) JVM
> startups.
>   -Cancelable reads.  This has proven to be very useful in multiuser
> applications that (pre)fetch large blocks of data but need to remain
> responsive for interactive users.  Rather than waiting for a large and/or
> slow read to finish it will return immediately and the associated resources
> (buffer, file descriptor) become available for the rest of the application
> to use.
>
> There are a couple known issues: the doc build isn't integrated with the
> rest of hadoop and the public API headers aren't being exported when
> building a distribution.  A short term solution for missing docs is to go
> through the libhdfs(3) compatible API and use the libhdfs docs.  Other than
> a few modifications to the pom files to integrate the build and the changes
> are isolated to a new directory so the chance of causing any regressions in
> the rest of the code is minimal.
>
> Please share your thoughts, thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org

Re: [DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Posted by Chris Douglas <cd...@apache.org>.

+1

Let's get this done. We've had many false starts on a native HDFS
client. This is a good base to build on. -C

On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer
<ja...@gmail.com> wrote:
> Hi everyone,
>
> I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
> into trunk.  I sent originally sent a similar email out last October but it
> sounds like it was buried by discussions about other feature merges that
> were going on at the time.
>
> libhdfs++ is an HDFS client written in C++ designed to be used in
> applications that are written in non-JVM based languages.  In its current
> state it supports kerberos authenticated reads from HDFS and has been used
> in production clusters for over a year so it has had a significant amount
> of burn-in time.  The HDFS-8707 branch has been around for about 2 years
> now so I'd like to know people's thoughts on what it would take to merge
> current branch and handling writes and encrypted reads in a new one.
>
> Current notable features:
>   -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
> drop-in replacement for clients that only need read support (until
> libhdfs++ also supports writes).
>   -An asynchronous C++ API with synchronous shims on top if the client
> application wants to do blocking operations.  Internally a single thread
> (optionally more) uses select/epoll by way of boost::asio to watch
> thousands of sockets without the overhead of spawning threads to emulate
> async operation.
>   -Kerberos/SASL authentication support
>   -HA namenode support
>   -A set of utility programs that mirror the HDFS CLI utilities e.g.
> "./hdfs dfs -chmod".  The major benefit of these is the tool startup time
> is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
> lot less memory since it isn't dealing with the JVM.  This makes it
> possible to do things like write a simple bash script that stats a file,
> applies some rules to the result, and decides if it should move it in a way
> that scales to thousands of files without being penalized with O(N) JVM
> startups.
>   -Cancelable reads.  This has proven to be very useful in multiuser
> applications that (pre)fetch large blocks of data but need to remain
> responsive for interactive users.  Rather than waiting for a large and/or
> slow read to finish it will return immediately and the associated resources
> (buffer, file descriptor) become available for the rest of the application
> to use.
>
> There are a couple known issues: the doc build isn't integrated with the
> rest of hadoop and the public API headers aren't being exported when
> building a distribution.  A short term solution for missing docs is to go
> through the libhdfs(3) compatible API and use the libhdfs docs.  Other than
> a few modifications to the pom files to integrate the build and the changes
> are isolated to a new directory so the chance of causing any regressions in
> the rest of the code is minimal.
>
> Please share your thoughts, thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org