You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ahmad Shahzad <as...@gmail.com> on 2010/06/10 12:25:38 UTC

Is it possible ....!!!

Hi,
    I wanted to ask if it is possible to intercept every communication that
takes place between hadoop's map reduce task i.e between JobTracker and
TaskTracker and make it pass through my own communication library.
So, if JobTracker and TaskTracker talk through http or rpc, i would like to
intercept the call and let it pass through my communication library. If it
is possible can anyone tell me that which set of classes i need to look at
hadoop's distribution.

Similarly, for the hdfs, is it possible to let all the communication that is
happening between namenode and datanode to pass through my communication
library.

Reason for doing that is that i want all the communication to happen through
a communication library that resolves every communication problem that we
can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
using that library all the headache of communication will be gone. So, we
will be able to use hadoop quite easily and there will be no communication
problems.

Thats my master's project. So, i want to know how to start and where to look
for.

I would really appreciate a reply.

Regards,
Ahmad Shahzad

Re: Is it possible ....!!! COOL!

Posted by Ryan Smith <ry...@gmail.com>.
Sounds like it could be a SPOF.

On Thu, Jun 10, 2010 at 7:47 AM, <hm...@umbc.edu> wrote:

> Hey,
>
> This is a really neat idea.... if anyone has a way to do this, could you
> share?
> I'll bet this could be very interesting! Thanks...
>
> Best,
> HAL
>
>
> > Hi,
> >     I wanted to ask if it is possible to intercept every communication
> > that
> > takes place between hadoop's map reduce task i.e between JobTracker and
> > TaskTracker and make it pass through my own communication library.
> > So, if JobTracker and TaskTracker talk through http or rpc, i would like
> > to
> > intercept the call and let it pass through my communication library. If
> it
> > is possible can anyone tell me that which set of classes i need to look
> at
> > hadoop's distribution.
> >
> > Similarly, for the hdfs, is it possible to let all the communication that
> > is
> > happening between namenode and datanode to pass through my communication
> > library.
> >
> > Reason for doing that is that i want all the communication to happen
> > through
> > a communication library that resolves every communication problem that we
> > can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
> > using that library all the headache of communication will be gone. So, we
> > will be able to use hadoop quite easily and there will be no
> communication
> > problems.
> >
> > Thats my master's project. So, i want to know how to start and where to
> > look
> > for.
> >
> > I would really appreciate a reply.
> >
> > Regards,
> > Ahmad Shahzad
> >
>
>
>

Re: Is it possible ....!!! COOL!

Posted by hm...@umbc.edu.
Hey,

This is a really neat idea.... if anyone has a way to do this, could you
share?
I'll bet this could be very interesting! Thanks...

Best,
HAL


> Hi,
>     I wanted to ask if it is possible to intercept every communication
> that
> takes place between hadoop's map reduce task i.e between JobTracker and
> TaskTracker and make it pass through my own communication library.
> So, if JobTracker and TaskTracker talk through http or rpc, i would like
> to
> intercept the call and let it pass through my communication library. If it
> is possible can anyone tell me that which set of classes i need to look at
> hadoop's distribution.
>
> Similarly, for the hdfs, is it possible to let all the communication that
> is
> happening between namenode and datanode to pass through my communication
> library.
>
> Reason for doing that is that i want all the communication to happen
> through
> a communication library that resolves every communication problem that we
> can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
> using that library all the headache of communication will be gone. So, we
> will be able to use hadoop quite easily and there will be no communication
> problems.
>
> Thats my master's project. So, i want to know how to start and where to
> look
> for.
>
> I would really appreciate a reply.
>
> Regards,
> Ahmad Shahzad
>



Re: Is it possible ....!!!

Posted by Owen O'Malley <om...@apache.org>.
You can define your own socket factory by setting the configuration parameter:

hadoop.rpc.socket.factory.class.default

to a class name of a SocketFactory. It is also possible to define
socket factories on a protocol by protocol basis. Look at the code in
NetUtils.getSocketFactory.

-- Owen

Re: Is it possible ....!!!

Posted by Steve Loughran <st...@apache.org>.
Aaron Kimball wrote:
> Hadoop has some classes for controlling how sockets are used. See
> org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.
> 
> The socket factory implementation chosen is controlled by the
> hadoop.rpc.socket.factory.class.default configuration parameter. You could
> probably write your own SocketFactory that gives back socket implementations
> that tee the conversation to another port, or to a file, etc.
> 
> So, "it's possible," but I don't know that anyone's implemented this. I
> think others may have examined Hadoop's protocols via wireshark or other
> external tools, but those don't have much insight into Hadoop's internals.
> (Neither, for that matter, would the socket factory. You'd probably need to
> be pretty clever to introspect as to exactly what type of message is being
> sent and actually do semantic analysis, etc.)

also worry about anything opening a URL, for which there are JVM-level 
factories, and Jetty which opens its own listeners, though presumably 
its the clients you'd want to play with.

I'm going to be honest and say this is a fairly ambitious project for a 
master's thesis because you are going to be nestling deep into code 
across the system, possibly making changes whose benefits people who run 
well managed datacentres won't see the benefit of (they don't have 
connectivity problems as they set up the machines and the network 
properly, it's only people like me whose home desktop is badly 
configured ( https://issues.apache.org/jira/browse/HADOOP-3426 )

Now, what might be handy is better diagnostics of the configuration,
  1. code to run on every machine to test the network, look at the 
config, play with DNS, detect problems and report them with meaningful 
errors that point to wiki pages with hints
  2. every service which opens ports to log this event somewhere 
(ideally to a service base class) so instead of trying to work out which 
ports hadoop is using by playing with netstat -p and jps -v, you can 
make a query of the nodes (command line, signal and GET /ports) and get 
each services list of active protocols, ports and IP addresses as text 
or JSON.
  3. some class to take that JSON list and then try to access the 
various things, log failures
  4. Some MR jobs to run the code in (3) and see what happens
  5. Some MR jobs whose aim in life is to measure network bandwidth and 
do stats on round trip times.

Just a thought :)


See also some thoughts of mine on Hadoop/university collaboration
http://www.slideshare.net/steve_l/hadoop-and-universities

Re: just because you can, it doesn't mean you should....

Posted by hm...@umbc.edu.
All,

Okay, I was being facetious earlier with the 'COOL' comment.

This is a very bad idea. Well, not so much bad, but think about the
ramifications of what you are proposing. Putting a 'comm' code lib together
that facilitates comms and 'helps' with architecture issues also creates a
a SPOF (as another gent pointed out); moreover, it creates a nice target
for exploitation as the lib will undoubtedly become a repository of embedded
passwords, alternate dummy accounts, bypass routes, and all sorts of goop
to make things 'easier'. And since is has to be world readable, and easy
to get
access to, it will be very tough to protect - or easy to DoS/DDoS. Anything
and everything from random timing attacks, substitution spoofs, TOUTOCs,
you name it.

This whole thing is already a very nice open highway to distribute
embedded and
tunneled 'items' of a certain unnatural nature, don't try to override what
little
security you have already by 'punching holes in the firewall' and other
silly stuff.

Long run, what might be better is a discovery agent that provides
continual validation
of paths and service availability specific to Hadoop and sub programs.
That way any
outage or problem can be immediately addressed or brought to the attention
of the
SysAds/Networkers. Like a service monitoring program. Just don't make it
simple for
the 'hats out there to own you in under five minutes flat (especially with
an rpc or soap call
to some lib or flat file - and ssh/ssl abso-lu-tely does not matter, trust
me). You can disagree, and I really don't mean to be a 'buzz kill', but if
you ask your local 'Sherrif',
I think you'll be advised not to pursue this path too heavily.

Have a good computational day...

Best, Hal


> Hadoop has some classes for controlling how sockets are used. See
org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.
>
> The socket factory implementation chosen is controlled by the
> hadoop.rpc.socket.factory.class.default configuration parameter. You
could
> probably write your own SocketFactory that gives back socket
> implementations
> that tee the conversation to another port, or to a file, etc.
>
> So, "it's possible," but I don't know that anyone's implemented this. I
think others may have examined Hadoop's protocols via wireshark or other
external tools, but those don't have much insight into Hadoop's
internals.
> (Neither, for that matter, would the socket factory. You'd probably need to
> be pretty clever to introspect as to exactly what type of message is
being
> sent and actually do semantic analysis, etc.)
>
> Allen's suggestion is probably more "correct," but might incur
additional
> work on your part.
>
> Cheers,
> - Aaron
>
> On Thu, Jun 10, 2010 at 3:54 PM, Allen Wittenauer
> <aw...@linkedin.com>wrote:
>
>> On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote:
>> > Reason for doing that is that i want all the communication to happen
>> through
>> > a communication library that resolves every communication problem
that
>> we
>> > can have e.g firewalls, NAT, non routed paths, multi homing etc etc.
>> By
>> > using that library all the headache of communication will be gone.
So,
>> we
>> > will be able to use hadoop quite easily and there will be no
>> communication
>> > problems.
>> I know Owen pointed you towards using proxies, but anything remotely
complex would probably be better in an interposer library, as then it
is
>> application agnostic.
>





Re: Is it possible ....!!!

Posted by Aaron Kimball <aa...@cloudera.com>.
Hadoop has some classes for controlling how sockets are used. See
org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.

The socket factory implementation chosen is controlled by the
hadoop.rpc.socket.factory.class.default configuration parameter. You could
probably write your own SocketFactory that gives back socket implementations
that tee the conversation to another port, or to a file, etc.

So, "it's possible," but I don't know that anyone's implemented this. I
think others may have examined Hadoop's protocols via wireshark or other
external tools, but those don't have much insight into Hadoop's internals.
(Neither, for that matter, would the socket factory. You'd probably need to
be pretty clever to introspect as to exactly what type of message is being
sent and actually do semantic analysis, etc.)

Allen's suggestion is probably more "correct," but might incur additional
work on your part.

Cheers,
- Aaron

On Thu, Jun 10, 2010 at 3:54 PM, Allen Wittenauer
<aw...@linkedin.com>wrote:

>
> On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote:
> > Reason for doing that is that i want all the communication to happen
> through
> > a communication library that resolves every communication problem that we
> > can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
> > using that library all the headache of communication will be gone. So, we
> > will be able to use hadoop quite easily and there will be no
> communication
> > problems.
>
> I know Owen pointed you towards using proxies, but anything remotely
> complex would probably be better in an interposer library, as then it is
> application agnostic.

Re: Is it possible ....!!!

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote:
> Reason for doing that is that i want all the communication to happen through
> a communication library that resolves every communication problem that we
> can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
> using that library all the headache of communication will be gone. So, we
> will be able to use hadoop quite easily and there will be no communication
> problems.

I know Owen pointed you towards using proxies, but anything remotely complex would probably be better in an interposer library, as then it is application agnostic.