You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Haohui Mai <hm...@hortonworks.com> on 2014/04/02 23:27:33 UTC

Putting the hdfs client as a separate jar

Hi,

Many downstream projects needs the ability to access hdfs. In order to do
this, currently downstream projects are forced to bring in the whole hdfs
jar and its dependency, since both the hdfs server and the hdfs client
reside in the same jar.

To integrate with hdfs, the downstream projects are forced to manage the
excess dependency from the hdfs server side (e.g., jersey, servlet, netty,
and jsp-runtime, just to name a few). In my own experience, I ended up
spending quite a bit of time on tweaking the POMs to work around collisions
of dependent jars.

To solve this problem, I propose to reorganize the code to put hdfs client
into a separate jar. That way the client jar no longer depends on the jars
that are required by the server side, therefore it is easier for the
downstream projects to integrate with hdfs.

Your feedbacks are appreciated.

Regards,
Haohui

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Haohui Mai <hm...@hortonworks.com>.
Tuning the POM only mitigates the problem. The problem of one HDFS jar is
that you can't rule out all unnecessary dependency. For example,
NamenodeWebHdfsMethods depends on jersey-server and servlet. The Apache
Falcon project has clients for HDFS, Hive, Pig, Oozie, thus it pulls in the
dependency. It also pulls in Tomcat as well.

Here is the bad part: the dependency from the client and Tomcat results in
three different versions of servlet jars in the classpath. They have the
same package name but they obviously have different behavior. We have to be
extremely careful on the orders of the jars in the classpath to make things
work.

One might be able to hand craft a hdfs-client POM to do the trick, but it
requires a lot of tuning and testing, and sometimes it might not be
feasible due to the dependency between the classes in the jar.

Pulling out the client to another module seems much cleaner and easier to
maintain.

~Haohui



On Thu, Apr 3, 2014 at 5:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

> to follow up with an example,
>
> JIRA on updating dependencies and tuning the POMs
> https://issues.apache.org/jira/browse/HADOOP-9991
>
>
>  here's a JIRA on dropping ZK from the hadoop-client POM
>
> https://issues.apache.org/jira/browse/HADOOP-9905
> 
> And there's an mr-client POM where we've been slowly cutting down on what
> it pulls in
> https://issues.apache.org/jira/browse/MAPREDUCE-5624
>
> This shows that
> 1. we can given maven/ivy projects what they need -and no more- through
> POM-only projects.
> 2. its an ongoing project to keep those dependencies cut down.
> 3. there's always the risk that you drop too much and some project
> discovers that while their code builds, it doesn't run any more.
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Haohui Mai <hm...@hortonworks.com>.
I filed HDFS-6200 to demonstrate the feasibility of the approach.

~Haohui


On Fri, Apr 4, 2014 at 11:46 AM, Haohui Mai <hm...@hortonworks.com> wrote:

> I agree with Nicholas, Steve and Alejandro that it might require some
> nontrivial to achieve the goal. Here is my high-level plan:
>
> 1. Create a new hdfs-client package, and gradually move classes from hdfs
> to hdfs-client. Fortunately IDEs like Eclipse and IntelliJ can do most of
> the heavy-liftings.
> 2. The hdfs package depends on the hdfs-client, because the webhdfs server
> uses a DFSClient internally.
> 3. Downstream projects can continue to depend on hdfs in the meantime, but
> they can start using the new hdfs-client projects once we make enough
> progress.
>
> That way the work can be done incrementally. Tackling the dependency of
> hadoop-common requires more work thus I plan to take care of it after we
> have a separate hdfs-client. Does it sound a reasonable plan?
>
> ~Haohui
>
>
> On Thu, Apr 3, 2014 at 10:16 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:
>
>> Haouhi's suggestion of a hdfs-client JAR with client dependencies only,
>> would be IMO the 'correct' way of doing things, we should have a
>> hdfs-server and hdfs-client JARs.
>>
>> Doing this is practice is not trivial as classes are not properly
>> segregated. So, Steven's suggestion of  an hdfs-client seems the best bet
>> short term.
>>
>> thx
>>
>>
>> On Thu, Apr 3, 2014 at 5:06 AM, Steve Loughran <stevel@hortonworks.com
>> >wrote:
>>
>> > to follow up with an example,
>> >
>> > JIRA on updating dependencies and tuning the POMs
>> > https://issues.apache.org/jira/browse/HADOOP-9991
>> >
>> >
>> >  here's a JIRA on dropping ZK from the hadoop-client POM
>> >
>> > https://issues.apache.org/jira/browse/HADOOP-9905
>> >
>> > And there's an mr-client POM where we've been slowly cutting down on
>> what
>> > it pulls in
>> > https://issues.apache.org/jira/browse/MAPREDUCE-5624
>> >
>> > This shows that
>> > 1. we can given maven/ivy projects what they need -and no more- through
>> > POM-only projects.
>> > 2. its an ongoing project to keep those dependencies cut down.
>> > 3. there's always the risk that you drop too much and some project
>> > discovers that while their code builds, it doesn't run any more.
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>> >
>>
>>
>>
>> --
>> Alejandro
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Haohui Mai <hm...@hortonworks.com>.
I agree with Nicholas, Steve and Alejandro that it might require some
nontrivial to achieve the goal. Here is my high-level plan:

1. Create a new hdfs-client package, and gradually move classes from hdfs
to hdfs-client. Fortunately IDEs like Eclipse and IntelliJ can do most of
the heavy-liftings.
2. The hdfs package depends on the hdfs-client, because the webhdfs server
uses a DFSClient internally.
3. Downstream projects can continue to depend on hdfs in the meantime, but
they can start using the new hdfs-client projects once we make enough
progress.

That way the work can be done incrementally. Tackling the dependency of
hadoop-common requires more work thus I plan to take care of it after we
have a separate hdfs-client. Does it sound a reasonable plan?

~Haohui


On Thu, Apr 3, 2014 at 10:16 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

> Haouhi's suggestion of a hdfs-client JAR with client dependencies only,
> would be IMO the 'correct' way of doing things, we should have a
> hdfs-server and hdfs-client JARs.
>
> Doing this is practice is not trivial as classes are not properly
> segregated. So, Steven's suggestion of  an hdfs-client seems the best bet
> short term.
>
> thx
>
>
> On Thu, Apr 3, 2014 at 5:06 AM, Steve Loughran <stevel@hortonworks.com
> >wrote:
>
> > to follow up with an example,
> >
> > JIRA on updating dependencies and tuning the POMs
> > https://issues.apache.org/jira/browse/HADOOP-9991
> >
> >
> >  here's a JIRA on dropping ZK from the hadoop-client POM
> >
> > https://issues.apache.org/jira/browse/HADOOP-9905
> >
> > And there's an mr-client POM where we've been slowly cutting down on what
> > it pulls in
> > https://issues.apache.org/jira/browse/MAPREDUCE-5624
> >
> > This shows that
> > 1. we can given maven/ivy projects what they need -and no more- through
> > POM-only projects.
> > 2. its an ongoing project to keep those dependencies cut down.
> > 3. there's always the risk that you drop too much and some project
> > discovers that while their code builds, it doesn't run any more.
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
>
>
> --
> Alejandro
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Haouhi's suggestion of a hdfs-client JAR with client dependencies only,
would be IMO the 'correct' way of doing things, we should have a
hdfs-server and hdfs-client JARs.

Doing this is practice is not trivial as classes are not properly
segregated. So, Steven's suggestion of  an hdfs-client seems the best bet
short term.

thx


On Thu, Apr 3, 2014 at 5:06 AM, Steve Loughran <st...@hortonworks.com>wrote:

> to follow up with an example,
>
> JIRA on updating dependencies and tuning the POMs
> https://issues.apache.org/jira/browse/HADOOP-9991
>
>
>  here's a JIRA on dropping ZK from the hadoop-client POM
>
> https://issues.apache.org/jira/browse/HADOOP-9905
>
> And there's an mr-client POM where we've been slowly cutting down on what
> it pulls in
> https://issues.apache.org/jira/browse/MAPREDUCE-5624
>
> This shows that
> 1. we can given maven/ivy projects what they need -and no more- through
> POM-only projects.
> 2. its an ongoing project to keep those dependencies cut down.
> 3. there's always the risk that you drop too much and some project
> discovers that while their code builds, it doesn't run any more.
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Alejandro

Re: Putting the hdfs client as a separate jar

Posted by Steve Loughran <st...@hortonworks.com>.
to follow up with an example,

JIRA on updating dependencies and tuning the POMs
https://issues.apache.org/jira/browse/HADOOP-9991


 here's a JIRA on dropping ZK from the hadoop-client POM

https://issues.apache.org/jira/browse/HADOOP-9905
​
And there's an mr-client POM where we've been slowly cutting down on what
it pulls in
https://issues.apache.org/jira/browse/MAPREDUCE-5624

This shows that
1. we can given maven/ivy projects what they need -and no more- through
POM-only projects.
2. its an ongoing project to keep those dependencies cut down.
3. there's always the risk that you drop too much and some project
discovers that while their code builds, it doesn't run any more.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Steve Loughran <st...@hortonworks.com>.
On 3 April 2014 00:02, Haohui Mai <hm...@hortonworks.com> wrote:

> The rpc and the web client can stay in one jar for the first cut. Indeed it
> might introduce some extra dependency, but the downstream projects always
> have the option to implement the webhdfs protocol themselves if they really
> need to avoid the dependency.
>
> Hadoop common is a bigger problem. Indeed the hadoop common jar needs to be
> separated into smaller modules to minimize the dependency. This needs to be
> addressed as well.
>

-1

its not needed, and having >1 JAR only introduces a new problem
"inconsistent versions of tightly coupled libraries on the classpath".

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Haohui Mai <hm...@hortonworks.com>.
The rpc and the web client can stay in one jar for the first cut. Indeed it
might introduce some extra dependency, but the downstream projects always
have the option to implement the webhdfs protocol themselves if they really
need to avoid the dependency.

Hadoop common is a bigger problem. Indeed the hadoop common jar needs to be
separated into smaller modules to minimize the dependency. This needs to be
addressed as well.

The good news is that it can be done in a incremental way. We can revisit
the dependency of hadoop-common after separating the jar of the hdfs client.

~Haohui


On Wed, Apr 2, 2014 at 2:48 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:

> It is a very good idea although it might not be easy to do.  One aspect to
> consider is that do we need separated jars for rpc client and web client?
>  Now, suppose we could successfully separate HFDS Client jar(s) from HDFS.
>  However, HDFS Client uses Common as a library.  We have to
> separate Common since it also has a lot of dependent jars.  I guess we
> might have to divide Common and HDFS into small modules and figure out the
> dependency between them.
>
> Tsz-Wo
> On Wednesday, April 2, 2014 2:28 PM, Haohui Mai <hm...@hortonworks.com>
> wrote:
>
> Hi,
> >
> >Many downstream projects needs the ability to access hdfs. In order to do
> >this, currently downstream projects are forced to bring in the whole hdfs
> >jar and its dependency, since both the hdfs server and the hdfs client
> >reside in the same jar.
> >
> >To integrate with hdfs, the downstream projects are forced to manage the
> >excess dependency from the hdfs server side (e.g., jersey, servlet, netty,
> >and jsp-runtime, just to name a few). In my own experience, I ended up
> >spending quite a bit of time on tweaking the POMs to work around
> collisions
> >of dependent jars.
> >
> >To solve this problem, I propose to reorganize the code to put hdfs client
> >into a separate jar. That way the client jar no longer depends on the jars
> >that are required by the server side, therefore it is easier for the
> >downstream projects to integrate with hdfs.
> >
> >Your feedbacks are appreciated.
> >
> >Regards,
> >Haohui
> >
> >--
> >CONFIDENTIALITY NOTICE
> >NOTICE: This message is intended for the use of the individual or entity
> to
> >which it is addressed and may contain information that is confidential,
> >privileged and exempt from disclosure under applicable law. If the reader
> >of this message is not the intended recipient, you are hereby notified
> that
> >any printing, copying, dissemination, distribution, disclosure or
> >forwarding of this communication is strictly prohibited. If you have
> >received this communication in error, please contact the sender
> immediately
> >and delete it from your system. Thank You.
> >
> >
> >

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Putting the hdfs client as a separate jar

Posted by Tsz Wo Sze <sz...@yahoo.com>.
It is a very good idea although it might not be easy to do.  One aspect to consider is that do we need separated jars for rpc client and web client?  Now, suppose we could successfully separate HFDS Client jar(s) from HDFS.  However, HDFS Client uses Common as a library.  We have to separate Common since it also has a lot of dependent jars.  I guess we might have to divide Common and HDFS into small modules and figure out the dependency between them.

Tsz-Wo
On Wednesday, April 2, 2014 2:28 PM, Haohui Mai <hm...@hortonworks.com> wrote:
 
Hi,
>
>Many downstream projects needs the ability to access hdfs. In order to do
>this, currently downstream projects are forced to bring in the whole hdfs
>jar and its dependency, since both the hdfs server and the hdfs client
>reside in the same jar.
>
>To integrate with hdfs, the downstream projects are forced to manage the
>excess dependency from the hdfs server side (e.g., jersey, servlet, netty,
>and jsp-runtime, just to name a few). In my own experience, I ended up
>spending quite a bit of time on tweaking the POMs to work around collisions
>of dependent jars.
>
>To solve this problem, I propose to reorganize the code to put hdfs client
>into a separate jar. That way the client jar no longer depends on the jars
>that are required by the server side, therefore it is easier for the
>downstream projects to integrate with hdfs.
>
>Your feedbacks are appreciated.
>
>Regards,
>Haohui
>
>-- 
>CONFIDENTIALITY NOTICE
>NOTICE: This message is intended for the use of the individual or entity to 
>which it is addressed and may contain information that is confidential, 
>privileged and exempt from disclosure under applicable law. If the reader 
>of this message is not the intended recipient, you are hereby notified that 
>any printing, copying, dissemination, distribution, disclosure or 
>forwarding of this communication is strictly prohibited. If you have 
>received this communication in error, please contact the sender immediately 
>and delete it from your system. Thank You.
>
>
>

Re: Putting the hdfs client as a separate jar

Posted by Steve Loughran <st...@hortonworks.com>.
It's not an issue with hdfs/hadoop JARs itself, but the POMs -and the same
problem exists with the hadoop core JAR - too much stuff you don't need
client side.

We can address this -without changing the packaging into an hdfs-client.jar
(and so complicating everything related to HDFS code).

All we need to do is create an hdfs-client POM which you add as a
dependency when you want the client and nothing server side.





On 2 April 2014 23:27, Haohui Mai <hm...@hortonworks.com> wrote:

> Hi,
>
> Many downstream projects needs the ability to access hdfs. In order to do
> this, currently downstream projects are forced to bring in the whole hdfs
> jar and its dependency, since both the hdfs server and the hdfs client
> reside in the same jar.
>
> To integrate with hdfs, the downstream projects are forced to manage the
> excess dependency from the hdfs server side (e.g., jersey, servlet, netty,
> and jsp-runtime, just to name a few). In my own experience, I ended up
> spending quite a bit of time on tweaking the POMs to work around collisions
> of dependent jars.
>
> To solve this problem, I propose to reorganize the code to put hdfs client
> into a separate jar. That way the client jar no longer depends on the jars
> that are required by the server side, therefore it is easier for the
> downstream projects to integrate with hdfs.
>
> Your feedbacks are appreciated.
>
> Regards,
> Haohui
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.