You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Remy Dubois <rd...@talend.com> on 2015/04/03 14:56:04 UTC

Hadoop and HttpFs

Hi everyone,

I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode to be able to read/write from/to HDFS. What happens if there are strong security policies on top of our cluster ?
I found the HttpFs (and webhdfs) that allows a client to talk to a single machine, in order to do what I'm looking for. Operations on HDFS work fine indeed.

Then, I've tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job using the same way. And here, there is these FileContext and AbstractFileSystem classes that don't allow any other FileSystem than hdfs and local. WebHdfs is then not accepted.
It's not a problem until you need to register a jar in your Pig application. Indeed, regarding the Load and the Store, prefixing their path with the webhdfs:// scheme works. But when you register a jar in the Pig application, the PigServer will reuse the initial configuration (the one with the hdfs://) in order to send the jars to the distributed cache. And at that point it fails because the client doesn't have access to the datanodes.

Am I right in my understanding of what happens in that case ?
Also, anyone meets this issue already? Any solution? Workaround?

Thanks a lot in advance,

Rémy.

Re: Hadoop and HttpFs

Posted by Harsh J <ha...@cloudera.com>.

You could consider using Oozie (via Hue or direct CLI). Users with
HTTPFS access can define and load job descriptors on HDFS, and then
use Oozie to trigger jobs within the cluster (the job desc. can use
regular hdfs:// URLs to work). This form of a setup would not need the
users to have cluster access beyond HTTPFS and Oozie end-points, and
both of these can be setup for auth/etc. for added security.

On Fri, Apr 3, 2015 at 6:26 PM, Remy Dubois <rd...@talend.com> wrote:
> Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and to
> have access to each single datanode to be able to read/write from/to HDFS.
> What happens if there are strong security policies on top of our cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job
> using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig application.
> Indeed, regarding the Load and the Store, prefixing their path with the
> webhdfs:// scheme works. But when you register a jar in the Pig application,
> the PigServer will reuse the initial configuration (the one with the
> hdfs://) in order to send the jars to the distributed cache. And at that
> point it fails because the client doesn’t have access to the datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.



-- 
Harsh J

Re: Hadoop and HttpFs

Posted by Harsh J <ha...@cloudera.com>.

You could consider using Oozie (via Hue or direct CLI). Users with
HTTPFS access can define and load job descriptors on HDFS, and then
use Oozie to trigger jobs within the cluster (the job desc. can use
regular hdfs:// URLs to work). This form of a setup would not need the
users to have cluster access beyond HTTPFS and Oozie end-points, and
both of these can be setup for auth/etc. for added security.

On Fri, Apr 3, 2015 at 6:26 PM, Remy Dubois <rd...@talend.com> wrote:
> Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and to
> have access to each single datanode to be able to read/write from/to HDFS.
> What happens if there are strong security policies on top of our cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job
> using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig application.
> Indeed, regarding the Load and the Store, prefixing their path with the
> webhdfs:// scheme works. But when you register a jar in the Pig application,
> the PigServer will reuse the initial configuration (the one with the
> hdfs://) in order to send the jars to the distributed cache. And at that
> point it fails because the client doesn’t have access to the datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.



-- 
Harsh J

Re: Hadoop and HttpFs

Posted by Harsh J <ha...@cloudera.com>.

You could consider using Oozie (via Hue or direct CLI). Users with
HTTPFS access can define and load job descriptors on HDFS, and then
use Oozie to trigger jobs within the cluster (the job desc. can use
regular hdfs:// URLs to work). This form of a setup would not need the
users to have cluster access beyond HTTPFS and Oozie end-points, and
both of these can be setup for auth/etc. for added security.

On Fri, Apr 3, 2015 at 6:26 PM, Remy Dubois <rd...@talend.com> wrote:
> Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and to
> have access to each single datanode to be able to read/write from/to HDFS.
> What happens if there are strong security policies on top of our cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job
> using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig application.
> Indeed, regarding the Load and the Store, prefixing their path with the
> webhdfs:// scheme works. But when you register a jar in the Pig application,
> the PigServer will reuse the initial configuration (the one with the
> hdfs://) in order to send the jars to the distributed cache. And at that
> point it fails because the client doesn’t have access to the datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.



-- 
Harsh J

RE: Hadoop and HttpFs

Posted by Remy Dubois <rd...@talend.com>.

Indeed, it could.

But I would prefer not to submit my jobs through WebHCat. Using HttpFs and webhdfs looked enough.
Any other idea?

Thanks in advance,


Remy DUBOIS
Senior Hadoop Developer
+33 1 46 25 06 00 | +33 6 11 73 56 72
rdubois@talend.com | www.talend.com<https://www.talend.com>
Talend SA | 9, rue Pages - 92150 Suresnes - France
[http://info.talend.com/rs/talend/images/talend-logo.png]
[Facebook]<http://www.facebook.com/talend>[Twitter]<http://twitter.com/talend>[LinkedIn]<http://www.linkedin.com/groups?home=&gid=812977&trk=anet_ug_hm>
From: Peyman Mohajerian [mailto:mohajeri@gmail.com]
Sent: vendredi 3 avril 2015 23:01
To: user@hadoop.apache.org
Subject: Re: Hadoop and HttpFs

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com>> wrote:
Hi everyone,

I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode to be able to read/write from/to HDFS. What happens if there are strong security policies on top of our cluster ?
I found the HttpFs (and webhdfs) that allows a client to talk to a single machine, in order to do what I’m looking for. Operations on HDFS work fine indeed.

Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job using the same way. And here, there is these FileContext and AbstractFileSystem classes that don’t allow any other FileSystem than hdfs and local. WebHdfs is then not accepted.
It’s not a problem until you need to register a jar in your Pig application. Indeed, regarding the Load and the Store, prefixing their path with the webhdfs:// scheme works. But when you register a jar in the Pig application, the PigServer will reuse the initial configuration (the one with the hdfs://) in order to send the jars to the distributed cache. And at that point it fails because the client doesn’t have access to the datanodes.

Am I right in my understanding of what happens in that case ?
Also, anyone meets this issue already? Any solution? Workaround?

Thanks a lot in advance,

Rémy.

RE: Hadoop and HttpFs

Posted by Remy Dubois <rd...@talend.com>.

Indeed, it could.

But I would prefer not to submit my jobs through WebHCat. Using HttpFs and webhdfs looked enough.
Any other idea?

Thanks in advance,


Remy DUBOIS
Senior Hadoop Developer
+33 1 46 25 06 00 | +33 6 11 73 56 72
rdubois@talend.com | www.talend.com<https://www.talend.com>
Talend SA | 9, rue Pages - 92150 Suresnes - France
[http://info.talend.com/rs/talend/images/talend-logo.png]
[Facebook]<http://www.facebook.com/talend>[Twitter]<http://twitter.com/talend>[LinkedIn]<http://www.linkedin.com/groups?home=&gid=812977&trk=anet_ug_hm>
From: Peyman Mohajerian [mailto:mohajeri@gmail.com]
Sent: vendredi 3 avril 2015 23:01
To: user@hadoop.apache.org
Subject: Re: Hadoop and HttpFs

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com>> wrote:
Hi everyone,

I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode to be able to read/write from/to HDFS. What happens if there are strong security policies on top of our cluster ?
I found the HttpFs (and webhdfs) that allows a client to talk to a single machine, in order to do what I’m looking for. Operations on HDFS work fine indeed.

Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job using the same way. And here, there is these FileContext and AbstractFileSystem classes that don’t allow any other FileSystem than hdfs and local. WebHdfs is then not accepted.
It’s not a problem until you need to register a jar in your Pig application. Indeed, regarding the Load and the Store, prefixing their path with the webhdfs:// scheme works. But when you register a jar in the Pig application, the PigServer will reuse the initial configuration (the one with the hdfs://) in order to send the jars to the distributed cache. And at that point it fails because the client doesn’t have access to the datanodes.

Am I right in my understanding of what happens in that case ?
Also, anyone meets this issue already? Any solution? Workaround?

Thanks a lot in advance,

Rémy.

RE: Hadoop and HttpFs

Posted by Remy Dubois <rd...@talend.com>.

Indeed, it could.

But I would prefer not to submit my jobs through WebHCat. Using HttpFs and webhdfs looked enough.
Any other idea?

Thanks in advance,


Remy DUBOIS
Senior Hadoop Developer
+33 1 46 25 06 00 | +33 6 11 73 56 72
rdubois@talend.com | www.talend.com<https://www.talend.com>
Talend SA | 9, rue Pages - 92150 Suresnes - France
[http://info.talend.com/rs/talend/images/talend-logo.png]
[Facebook]<http://www.facebook.com/talend>[Twitter]<http://twitter.com/talend>[LinkedIn]<http://www.linkedin.com/groups?home=&gid=812977&trk=anet_ug_hm>
From: Peyman Mohajerian [mailto:mohajeri@gmail.com]
Sent: vendredi 3 avril 2015 23:01
To: user@hadoop.apache.org
Subject: Re: Hadoop and HttpFs

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com>> wrote:
Hi everyone,

I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode to be able to read/write from/to HDFS. What happens if there are strong security policies on top of our cluster ?
I found the HttpFs (and webhdfs) that allows a client to talk to a single machine, in order to do what I’m looking for. Operations on HDFS work fine indeed.

Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job using the same way. And here, there is these FileContext and AbstractFileSystem classes that don’t allow any other FileSystem than hdfs and local. WebHdfs is then not accepted.
It’s not a problem until you need to register a jar in your Pig application. Indeed, regarding the Load and the Store, prefixing their path with the webhdfs:// scheme works. But when you register a jar in the Pig application, the PigServer will reuse the initial configuration (the one with the hdfs://) in order to send the jars to the distributed cache. And at that point it fails because the client doesn’t have access to the datanodes.

Am I right in my understanding of what happens in that case ?
Also, anyone meets this issue already? Any solution? Workaround?

Thanks a lot in advance,

Rémy.

RE: Hadoop and HttpFs

Posted by Remy Dubois <rd...@talend.com>.

Indeed, it could.

But I would prefer not to submit my jobs through WebHCat. Using HttpFs and webhdfs looked enough.
Any other idea?

Thanks in advance,


Remy DUBOIS
Senior Hadoop Developer
+33 1 46 25 06 00 | +33 6 11 73 56 72
rdubois@talend.com | www.talend.com<https://www.talend.com>
Talend SA | 9, rue Pages - 92150 Suresnes - France
[http://info.talend.com/rs/talend/images/talend-logo.png]
[Facebook]<http://www.facebook.com/talend>[Twitter]<http://twitter.com/talend>[LinkedIn]<http://www.linkedin.com/groups?home=&gid=812977&trk=anet_ug_hm>
From: Peyman Mohajerian [mailto:mohajeri@gmail.com]
Sent: vendredi 3 avril 2015 23:01
To: user@hadoop.apache.org
Subject: Re: Hadoop and HttpFs

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com>> wrote:
Hi everyone,

I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode to be able to read/write from/to HDFS. What happens if there are strong security policies on top of our cluster ?
I found the HttpFs (and webhdfs) that allows a client to talk to a single machine, in order to do what I’m looking for. Operations on HDFS work fine indeed.

Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job using the same way. And here, there is these FileContext and AbstractFileSystem classes that don’t allow any other FileSystem than hdfs and local. WebHdfs is then not accepted.
It’s not a problem until you need to register a jar in your Pig application. Indeed, regarding the Load and the Store, prefixing their path with the webhdfs:// scheme works. But when you register a jar in the Pig application, the PigServer will reuse the initial configuration (the one with the hdfs://) in order to send the jars to the distributed cache. And at that point it fails because the client doesn’t have access to the datanodes.

Am I right in my understanding of what happens in that case ?
Also, anyone meets this issue already? Any solution? Workaround?

Thanks a lot in advance,

Rémy.

Re: Hadoop and HttpFs

Posted by Peyman Mohajerian <mo...@gmail.com>.

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com> wrote:

>  Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and
> to have access to each single datanode to be able to read/write from/to
> HDFS. What happens if there are strong security policies on top of our
> cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0)
> job using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig
> application. Indeed, regarding the Load and the Store, prefixing their path
> with the webhdfs:// scheme works. But when you register a jar in the Pig
> application, the PigServer will reuse the initial configuration (the one
> with the hdfs://) in order to send the jars to the distributed cache. And
> at that point it fails because the client doesn’t have access to the
> datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.
>

Re: Hadoop and HttpFs

Posted by Peyman Mohajerian <mo...@gmail.com>.

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com> wrote:

>  Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and
> to have access to each single datanode to be able to read/write from/to
> HDFS. What happens if there are strong security policies on top of our
> cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0)
> job using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig
> application. Indeed, regarding the Load and the Store, prefixing their path
> with the webhdfs:// scheme works. But when you register a jar in the Pig
> application, the PigServer will reuse the initial configuration (the one
> with the hdfs://) in order to send the jars to the distributed cache. And
> at that point it fails because the client doesn’t have access to the
> datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.
>

Re: Hadoop and HttpFs

Posted by Harsh J <ha...@cloudera.com>.

You could consider using Oozie (via Hue or direct CLI). Users with
HTTPFS access can define and load job descriptors on HDFS, and then
use Oozie to trigger jobs within the cluster (the job desc. can use
regular hdfs:// URLs to work). This form of a setup would not need the
users to have cluster access beyond HTTPFS and Oozie end-points, and
both of these can be setup for auth/etc. for added security.

On Fri, Apr 3, 2015 at 6:26 PM, Remy Dubois <rd...@talend.com> wrote:
> Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and to
> have access to each single datanode to be able to read/write from/to HDFS.
> What happens if there are strong security policies on top of our cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0) job
> using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig application.
> Indeed, regarding the Load and the Store, prefixing their path with the
> webhdfs:// scheme works. But when you register a jar in the Pig application,
> the PigServer will reuse the initial configuration (the one with the
> hdfs://) in order to send the jars to the distributed cache. And at that
> point it fails because the client doesn’t have access to the datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.



-- 
Harsh J

Re: Hadoop and HttpFs

Posted by Peyman Mohajerian <mo...@gmail.com>.

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com> wrote:

>  Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and
> to have access to each single datanode to be able to read/write from/to
> HDFS. What happens if there are strong security policies on top of our
> cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0)
> job using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig
> application. Indeed, regarding the Load and the Store, prefixing their path
> with the webhdfs:// scheme works. But when you register a jar in the Pig
> application, the PigServer will reuse the initial configuration (the one
> with the hdfs://) in order to send the jars to the distributed cache. And
> at that point it fails because the client doesn’t have access to the
> datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.
>

Re: Hadoop and HttpFs

Posted by Peyman Mohajerian <mo...@gmail.com>.

May be this helps:

https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig



On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois <rd...@talend.com> wrote:

>  Hi everyone,
>
>
>
> I used to think about the constraint that a Hadoop client has to know and
> to have access to each single datanode to be able to read/write from/to
> HDFS. What happens if there are strong security policies on top of our
> cluster ?
>
> I found the HttpFs (and webhdfs) that allows a client to talk to a single
> machine, in order to do what I’m looking for. Operations on HDFS work fine
> indeed.
>
>
>
> Then, I’ve tried to execute a Pig (with Pig 0.12 on top of Hadoop 2.3.0)
> job using the same way. And here, there is these FileContext and
> AbstractFileSystem classes that don’t allow any other FileSystem than hdfs
> and local. WebHdfs is then not accepted.
>
> It’s not a problem until you need to register a jar in your Pig
> application. Indeed, regarding the Load and the Store, prefixing their path
> with the webhdfs:// scheme works. But when you register a jar in the Pig
> application, the PigServer will reuse the initial configuration (the one
> with the hdfs://) in order to send the jars to the distributed cache. And
> at that point it fails because the client doesn’t have access to the
> datanodes.
>
>
>
> Am I right in my understanding of what happens in that case ?
>
> Also, anyone meets this issue already? Any solution? Workaround?
>
>
>
> Thanks a lot in advance,
>
>
>
> Rémy.
>