You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by David Ritch <da...@gmail.com> on 2008/11/13 21:19:37 UTC

Web Proxy to Access DataNodes

Has anyone configured Apache as a reverse proxy to allow access to your
cloud?  I'm having trouble doing this.

I have a cloud.  My datanodes are not visible outside the cloud for
security.  I'd like to provide some degree of access for my developers,
using a proxy.  I have Apache-2.2, and can get part of the way there, but
I'm having trouble with the proxy rules and ports.

I'm running a name node on NN, and Apache on WS.  I'd like to be able to go
to http://WS/NN, and have that map to http://NN:50070/, and show me the
NameNode status.

I tried the naive configuration of Apache:

ProxyPass /NN http://NN:50070
ProxyPassReverse /NN http://NN:50070

This maps the first page properly, and gets the redirect to dfshealth.
However, it does not do the reverse remap, and I get
http://WS/dfhshealth.jsp.

So, I tried the following:

ProxyPass /NN http://NN
ProxyPassReverse /NN http://NN

and accessed http://WS/NN:50070/ in my browser.  Again, the initial touch
resulted in a successful proxy operation.  This time, the rewrite of the
reference succeeded, and my browser tried to access
http://WS/NN:50070/dfshealth.jsp; however, this did not result in a proxy
operation.  Instead, my web browser tried to access a local file
/var/www/html/NN:50070.  Apparently, it thought that would be a directory,
and dfshealth.jsp would be a file in it.

So - what am I doing wrong?  Is there a reason that I should be using an
entirely different approach?

Thanks!

David

Re: Web Proxy to Access DataNodes

Posted by "David B. Ritch" <da...@gmail.com>.

Thanks, Karl.  I'll take a look at that script.

David

Karl Anderson wrote:
>
> On 13-Nov-08, at 8:44 PM, David Ritch wrote:
>
>>
>> On Thu, Nov 13, 2008 at 7:32 PM, Alex Loddengaard <al...@cloudera.com>
>> wrote:
>>
>>> You could also have your developers setup a SOCKS proxy with the -D
>>> option
>>> to ssh.  Then have them install FoxyProxy.
>
> hadoop-ec2 has a utility to make this easy:
>
> src/contrib/ec2/bin/hadoop-ec2 proxy <clustername>
>
> Check out the source if you need a hint for non-ec2, it's a shell script.
>
> In the wiki there's instructions on how to use FoxyProxy with this.  I
> fire up a few clusters a day, and with this I can get to every link in
> the web interface with no setup needed.
>
> Karl Anderson
> kra@monkey.org
> http://monkey.org/~kra
>
>
>
>

Re: Web Proxy to Access DataNodes

Posted by Karl Anderson <kr...@monkey.org>.

On 13-Nov-08, at 8:44 PM, David Ritch wrote:

>
> On Thu, Nov 13, 2008 at 7:32 PM, Alex Loddengaard  
> <al...@cloudera.com> wrote:
>
>> You could also have your developers setup a SOCKS proxy with the -D  
>> option
>> to ssh.  Then have them install FoxyProxy.

hadoop-ec2 has a utility to make this easy:

src/contrib/ec2/bin/hadoop-ec2 proxy <clustername>

Check out the source if you need a hint for non-ec2, it's a shell  
script.

In the wiki there's instructions on how to use FoxyProxy with this.  I  
fire up a few clusters a day, and with this I can get to every link in  
the web interface with no setup needed.

Karl Anderson
kra@monkey.org
http://monkey.org/~kra

Re: Web Proxy to Access DataNodes

Posted by David Ritch <da...@gmail.com>.

Thanks, Alex - I'll take a look at socks for this.  However, the latest
versions of Apache also have regexp-based proxying.  On the other hand - I
haven't quite gotten it to work, either.

David

On Thu, Nov 13, 2008 at 7:32 PM, Alex Loddengaard <al...@cloudera.com> wrote:

> You could also have your developers setup a SOCKS proxy with the -D option
> to ssh.  Then have them install FoxyProxy.
> The solution you're trying to do will make maintaining access to your
> datanodes difficult.  That is, for each new datanode, you'll have to add a
> proxy rule to Apache.  With the SOCKS setup, FoxyProxy can be configured to
> use regular expressions, hence proxying all requests to any one of your
> nodes.
>
> Googling for "SOCKS proxy ssh" should give you more info on how to get this
> up and running.  We used this set up for a hack contest we hosted at
> ApacheCon and it worked well.
>
> Alex
>
> On Thu, Nov 13, 2008 at 12:19 PM, David Ritch <da...@gmail.com>
> wrote:
>
> > Has anyone configured Apache as a reverse proxy to allow access to your
> > cloud?  I'm having trouble doing this.
> >
> > I have a cloud.  My datanodes are not visible outside the cloud for
> > security.  I'd like to provide some degree of access for my developers,
> > using a proxy.  I have Apache-2.2, and can get part of the way there, but
> > I'm having trouble with the proxy rules and ports.
> >
> > I'm running a name node on NN, and Apache on WS.  I'd like to be able to
> go
> > to http://WS/NN, and have that map to http://NN:50070/, and show me the
> > NameNode status.
> >
> > I tried the naive configuration of Apache:
> >
> > ProxyPass /NN http://NN:50070
> > ProxyPassReverse /NN http://NN:50070
> >
> > This maps the first page properly, and gets the redirect to dfshealth.
> > However, it does not do the reverse remap, and I get
> > http://WS/dfhshealth.jsp.
> >
> > So, I tried the following:
> >
> > ProxyPass /NN http://NN
> > ProxyPassReverse /NN http://NN
> >
> > and accessed http://WS/NN:50070/ in my browser.  Again, the initial
> touch
> > resulted in a successful proxy operation.  This time, the rewrite of the
> > reference succeeded, and my browser tried to access
> > http://WS/NN:50070/dfshealth.jsp; however, this did not result in a
> proxy
> > operation.  Instead, my web browser tried to access a local file
> > /var/www/html/NN:50070.  Apparently, it thought that would be a
> directory,
> > and dfshealth.jsp would be a file in it.
> >
> > So - what am I doing wrong?  Is there a reason that I should be using an
> > entirely different approach?
> >
> > Thanks!
> >
> > David
> >
>

Re: Any suggestion on performance improvement ?

Posted by Aaron Kimball <aa...@cloudera.com>.

It's worth pointing out that Hadoop really isn't designed to run at this low
of a scale. Hadoop's performance doesn't really begin to kick in until
you've got 10's of GB's of data.

The question is sort of like asking "how can I make an 18-wheeler run faster
when carrying only a single bag of groceries."

There is a large amount of overhead associated with starting Hadoop; in
particular, starting a bunch of JVMs. The TaskTrackers only poll for new
work every 10 seconds, so every Hadoop job is going to be 10 seconds long
minimum, before the job actually gets slotted into worker nodes. The
remaining 20 seconds of time is most likely eaten up by similar overheads.
These stop being a factor when you actually have a sizeable amount of data
worth reading.

You are correct that adding more nodes won't help. For 60 MB of data, it's
only spawning one task, on one worker node.

You might want to configure Hadoop to run in single-threaded mode on a
single machine, and ditch the cluster entirely. Set 'mapred.job.tracker' to
'local' and 'fs.default.name' to 'file:///some/dir/in/the/local/machine',
and it should run Hadoop entirely within a single JVM.

- Aaron

On Fri, Nov 14, 2008 at 11:12 AM, souravm <SO...@infosys.com> wrote:

> Hi Alex,
>
> I get 30-40 secs of response time for around 60MB of data. The number of
> Map and Reduce task is 1 each. This is because the default HDFS block size
> is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is
> optimal.
>
> Now this being the unit of performance even if I increase the number of
> node I don't think the performance would be better.
>
> Regards,
> Sourav
> -----Original Message-----
> From: Alex Loddengaard [mailto:alex@cloudera.com]
> Sent: Friday, November 14, 2008 9:44 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Any suggestion on performance improvement ?
>
> How big is the data that you're loading and filtering?  Your cluster is
> pretty small, so if you have data on the magnitude of tens or hundreds of
> GBs, then the performance you're describing is probably to be expected.
> How many map and reduce tasks are you running on each node?
>
> Alex
>
> On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:
>
> > Hi,
> >
> > I'm testing with a 4 node setup of Hadoop hdfs.
> >
> > Each node has configuration of 2GB memory and dual core and around 30-60
> GB
> > disk space.
> >
> > I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
> >
> > I'm querying those files using PIG. What I'm seeing that even a simple
> > select query (LOAD and FILTER) is taking at least 30-40 sec of time. The
> MAP
> > process in one node takes at least 25 sec.
> >
> > I've kept the jvm max heap size to 1024m.
> >
> > Any suggestion on how to improve the performance with different
> > configuration at Hadoop level (by changing hdfs and MapReduce parameters)
> ?
> >
> > Regards,
> > Sourav
> >
> > **************** CAUTION - Disclaimer *****************
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> > solely
> > for the use of the addressee(s). If you are not the intended recipient,
> > please
> > notify the sender by e-mail and delete the original message. Further, you
> > are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> > person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has
> > taken
> > every reasonable precaution to minimize this risk, but is not liable for
> > any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> > out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves
> > the
> > right to monitor and review the content of all messages sent to or from
> > this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS******** End of Disclaimer ********INFOSYS***
> >
>

RE: Any suggestion on performance improvement ?

Posted by souravm <SO...@infosys.com>.

Hi Alex,

I get 30-40 secs of response time for around 60MB of data. The number of Map and Reduce task is 1 each. This is because the default HDFS block size is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is optimal.

Now this being the unit of performance even if I increase the number of node I don't think the performance would be better.

Regards,
Sourav
-----Original Message-----
From: Alex Loddengaard [mailto:alex@cloudera.com] 
Sent: Friday, November 14, 2008 9:44 AM
To: core-user@hadoop.apache.org
Subject: Re: Any suggestion on performance improvement ?

How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:

> Hi,
>
> I'm testing with a 4 node setup of Hadoop hdfs.
>
> Each node has configuration of 2GB memory and dual core and around 30-60 GB
> disk space.
>
> I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
>
> I'm querying those files using PIG. What I'm seeing that even a simple
> select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
> process in one node takes at least 25 sec.
>
> I've kept the jvm max heap size to 1024m.
>
> Any suggestion on how to improve the performance with different
> configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

Re: Any suggestion on performance improvement ?

Posted by Alex Loddengaard <al...@cloudera.com>.

How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm <SO...@infosys.com> wrote:

> Hi,
>
> I'm testing with a 4 node setup of Hadoop hdfs.
>
> Each node has configuration of 2GB memory and dual core and around 30-60 GB
> disk space.
>
> I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.
>
> I'm querying those files using PIG. What I'm seeing that even a simple
> select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
> process in one node takes at least 25 sec.
>
> I've kept the jvm max heap size to 1024m.
>
> Any suggestion on how to improve the performance with different
> configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>

Any suggestion on performance improvement ?

Posted by souravm <SO...@infosys.com>.

Hi,

I'm testing with a 4 node setup of Hadoop hdfs.

Each node has configuration of 2GB memory and dual core and around 30-60 GB disk space.

I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

I'm querying those files using PIG. What I'm seeing that even a simple select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP process in one node takes at least 25 sec.

I've kept the jvm max heap size to 1024m.

Any suggestion on how to improve the performance with different configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Web Proxy to Access DataNodes

Posted by Alex Loddengaard <al...@cloudera.com>.

You could also have your developers setup a SOCKS proxy with the -D option
to ssh.  Then have them install FoxyProxy.
The solution you're trying to do will make maintaining access to your
datanodes difficult.  That is, for each new datanode, you'll have to add a
proxy rule to Apache.  With the SOCKS setup, FoxyProxy can be configured to
use regular expressions, hence proxying all requests to any one of your
nodes.

Googling for "SOCKS proxy ssh" should give you more info on how to get this
up and running.  We used this set up for a hack contest we hosted at
ApacheCon and it worked well.

Alex

On Thu, Nov 13, 2008 at 12:19 PM, David Ritch <da...@gmail.com> wrote:

> Has anyone configured Apache as a reverse proxy to allow access to your
> cloud?  I'm having trouble doing this.
>
> I have a cloud.  My datanodes are not visible outside the cloud for
> security.  I'd like to provide some degree of access for my developers,
> using a proxy.  I have Apache-2.2, and can get part of the way there, but
> I'm having trouble with the proxy rules and ports.
>
> I'm running a name node on NN, and Apache on WS.  I'd like to be able to go
> to http://WS/NN, and have that map to http://NN:50070/, and show me the
> NameNode status.
>
> I tried the naive configuration of Apache:
>
> ProxyPass /NN http://NN:50070
> ProxyPassReverse /NN http://NN:50070
>
> This maps the first page properly, and gets the redirect to dfshealth.
> However, it does not do the reverse remap, and I get
> http://WS/dfhshealth.jsp.
>
> So, I tried the following:
>
> ProxyPass /NN http://NN
> ProxyPassReverse /NN http://NN
>
> and accessed http://WS/NN:50070/ in my browser.  Again, the initial touch
> resulted in a successful proxy operation.  This time, the rewrite of the
> reference succeeded, and my browser tried to access
> http://WS/NN:50070/dfshealth.jsp; however, this did not result in a proxy
> operation.  Instead, my web browser tried to access a local file
> /var/www/html/NN:50070.  Apparently, it thought that would be a directory,
> and dfshealth.jsp would be a file in it.
>
> So - what am I doing wrong?  Is there a reason that I should be using an
> entirely different approach?
>
> Thanks!
>
> David
>