You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Jim Clampffer <ja...@gmail.com> on 2017/10/25 19:52:21 UTC

[DISCUSS] Merging libhdfs++ (HDFS-8707) into trunk

Hi everyone,

I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
into trunk (as a beta release).

libhdfs++ is an HDFS client written in C++ designed to be used in
applications that are written in non-JVM based languages.  In its current
state it supports kerberos authenticated reads from HDFS and has been used
in production clusters for over a year so it has had a significant amount
of burn-in time.  The HDFS-8707 branch has been around for about 2 years
now so I'd like to know people's thoughts on what it would take to merge in
the current branch and possibly start a new one for handling writes and
encrypted reads.

Current features:
  -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
drop-in replacement for clients that only need read support
  -An asynchronous C++ API with synchronous shims on top if the client
application wants to do blocking operations.  Internally a single thread
(optionally more) uses select/epoll by way of boost::asio to watch
thousands of sockets without the overhead of spawning threads to emulate
async operation.
  -Kerberos/SASL authentication support
  -HA namenode support
  -A set of utility programs that mirror the HDFS CLI utilities e.g.
"./hdfs dfs -chmod".  The major benefit of these is the tool startup time
is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
lot less memory since it isn't dealing with the JVM.  This makes it
possible to do things like write a simple bash script that stats a file,
applies some rules to the result, and decides if it should move it in a way
that scales to thousands of files without being penalized with O(N) JVM
startups.
  -Cancelable reads.  This has proven to be very useful in multiuser
applications that (pre)fetch large blocks of data but need to remain
responsive for interactive users.  Rather than waiting for a large and/or
slow read to finish it will return immediately and the associated resources
(buffer, file descriptor) become available for the rest of the application
to use.

There's a few known issues that prevent a merge of the branch as-is,
notably that it's lagging extremely far behind trunk - HDFS-12110.  There's
a patch up to get in sync but that's waiting on CI tests to be unstuck -
HDFS-12640, which I haven't been able to figure out (if anyone has tips for
investigating this I'd really appreciate it).  The other two issues that
have been raised are that headers and docs aren't being exported to the
correct places when building a distro which will be straightforward to fix
once the rebase is done.

Thanks!

Fwd: [DISCUSS] Merging libhdfs++ (HDFS-8707) into trunk

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.

FYI: a native C++ libhdfs is coming into trunk soon

Thanks,
Roman.

---------- Forwarded message ----------
From: Jim Clampffer <ja...@gmail.com>
Date: Wed, Oct 25, 2017 at 12:52 PM
Subject: [DISCUSS] Merging libhdfs++ (HDFS-8707) into trunk
To: hdfs-dev@hadoop.apache.org

Hi everyone,

I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
into trunk (as a beta release).

libhdfs++ is an HDFS client written in C++ designed to be used in
applications that are written in non-JVM based languages.  In its current
state it supports kerberos authenticated reads from HDFS and has been used
in production clusters for over a year so it has had a significant amount
of burn-in time.  The HDFS-8707 branch has been around for about 2 years
now so I'd like to know people's thoughts on what it would take to merge in
the current branch and possibly start a new one for handling writes and
encrypted reads.

Current features:
  -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
drop-in replacement for clients that only need read support
  -An asynchronous C++ API with synchronous shims on top if the client
application wants to do blocking operations.  Internally a single thread
(optionally more) uses select/epoll by way of boost::asio to watch
thousands of sockets without the overhead of spawning threads to emulate
async operation.
  -Kerberos/SASL authentication support
  -HA namenode support
  -A set of utility programs that mirror the HDFS CLI utilities e.g.
"./hdfs dfs -chmod".  The major benefit of these is the tool startup time
is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
lot less memory since it isn't dealing with the JVM.  This makes it
possible to do things like write a simple bash script that stats a file,
applies some rules to the result, and decides if it should move it in a way
that scales to thousands of files without being penalized with O(N) JVM
startups.
  -Cancelable reads.  This has proven to be very useful in multiuser
applications that (pre)fetch large blocks of data but need to remain
responsive for interactive users.  Rather than waiting for a large and/or
slow read to finish it will return immediately and the associated resources
(buffer, file descriptor) become available for the rest of the application
to use.

There's a few known issues that prevent a merge of the branch as-is,
notably that it's lagging extremely far behind trunk - HDFS-12110.  There's
a patch up to get in sync but that's waiting on CI tests to be unstuck -
HDFS-12640, which I haven't been able to figure out (if anyone has tips for
investigating this I'd really appreciate it).  The other two issues that
have been raised are that headers and docs aren't being exported to the
correct places when building a distro which will be straightforward to fix
once the rebase is done.

Thanks!