You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Srinivas M <sm...@gmail.com> on 2017/03/06 12:05:14 UTC

Support for writing ORC file while connecting through knox

Hi

I have an application that uses the Hive ORC API and to write a ORC file to
HDFS. I use the native FileSystem API and pass the WebHDFS URI
(webhdfs://host:port) to create a FileSystem Object

fs = FileSystem.get(hdfsuri,conf,_user) ;

While trying to connect using the Knox gateway, is there a way to still use
the Native FileSystem or should I be using the REST API calls to be able to
access the Files on HDFS ?

If so, is there any way to read or write an ORC file in such a case, given
that the ORC Reader or Writers, needs an object of type "
org.apache.hadoop.fs.FileSystem"

-- 
Srinivas
(*-*)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
You have to grow from the inside out. None can teach you, none can make you
spiritual.
                      -Narendra Nath Dutta(Swamy Vivekananda)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Re: Support for writing ORC file while connecting through knox

Posted by Owen O'Malley <om...@apache.org>.

If the Knox team implements the Hadoop FileSystem API, the ORC reader and
writer could use it automatically.

.. Owen

On Mon, Mar 6, 2017 at 10:17 PM, Srinivas M <sm...@gmail.com> wrote:

> Thanks Owen and Larry for your perspective on this. This information is
> helpful. I shall explore on the alternatives to meet the requirements of my
> use case for now.
>
> On a side note, it was mentioned that there are plans (or it is being
> considered) to add the knoxFS. I have a question on that. As and when such
> a client / API is made available, should the ORC implementation also has to
> be enhanced to support the knoxFS in the ORC API or should that come in by
> default. Or it would be too early to discuss on that.
>
> On Tue, Mar 7, 2017 at 12:20 AM, larry mccay <lm...@apache.org> wrote:
>
>> Thanks for adding the Knox list to this conversation, Owen!
>>
>> This is an interesting topic and one that we should define an end-to-end
>> usecase for.
>>
>> We have considered a number of things to address this at one time or
>> another and have encountered one or more roadblocks on some of them:
>>
>> * Knox (or Proxy) FileSystem implementation that would accommodate the
>> use of addition context needed to route requests through a proxy such as
>> Knox by altering the default URLs to match what is expected by Knox. There
>> was a POC of this done a while back and we can try and dust that off.
>> * Knox did have a feature for configuring the "default topology" which
>> would allow the URLs that are expected to be used with webhdfs direct to
>> work and Knox would translate the interactions into the context of the
>> configured default URLs. This feature is currently not working
>> unfortunately and we have a JIRA filed to correct that.
>> * There may be work needed in the java webhdfs client in order to
>> accommodate SPNEGO on the redirected DN interactions. Currently, the DN
>> doesn't expect the hadoop.auth cookie but a block access token instead (I
>> believe). So, when the block access token is presented to a Knox instance
>> that is configured to use the Hadoop Auth provider it doesn't find a
>> hadoop.auth cookie so it challenges the client again. This is not expected
>> in existing clients and it throws an exception. Investigation needed here
>> for most efficient way to address this.
>>
>> Incidentally, you may also consider looking at the KnoxShell client
>> classes to write a file to HDFS.
>>
>> http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details
>>
>> The example below is showing how to use the groovy based DSL for
>> establishing a "session", deleting, writing and reading files to HDFS.
>> The underlying java classes can be used directly as well as an SDK to do
>> the same.
>>
>> Uptaking the gateway-shell module can be easily done by adding a maven
>> dependency to your project for that module.
>> Additionally, the 0.12.0 release which is currently undergoing a VOTE for
>> release contains a separate client release artifact for download.
>>
>> import org.apache.hadoop.gateway.shell.Hadoop
>> import org.apache.hadoop.gateway.shell.hdfs.Hdfs
>> import groovy.json.JsonSlurper
>>
>> gateway = "https://localhost:8443/gateway/sandbox"
>> username = "guest"
>> password = "guest-password"
>> dataFile = "README"
>>
>> session = Hadoop.login( gateway, username, password )
>> Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
>> Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
>> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
>> json = (new JsonSlurper()).parseText( text )
>> println json.FileStatuses.FileStatus.pathSuffix
>> session.shutdown()
>> exit
>>
>>
>> On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org>
>> wrote:
>>
>>> Unfortunately, in the short run, you'll need to copy them locally using
>>> wget or curl and then read the ORC file using file:/// paths to use the
>>> local file system.
>>>
>>> I talked with Larry McCay from the Knox project and he said that they are
>>> considering making a KnoxFS Java client, which implements
>>> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>>>
>>> .. Owen
>>>
>>> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>>>
>>> > Hi
>>> >
>>> > I have an application that uses the Hive ORC API and to write a ORC
>>> file
>>> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
>>> > (webhdfs://host:port) to create a FileSystem Object
>>> >
>>> > fs = FileSystem.get(hdfsuri,conf,_user) ;
>>> >
>>> > While trying to connect using the Knox gateway, is there a way to still
>>> > use the Native FileSystem or should I be using the REST API calls to be
>>> > able to access the Files on HDFS ?
>>> >
>>> > If so, is there any way to read or write an ORC file in such a case,
>>> given
>>> > that the ORC Reader or Writers, needs an object of type "
>>> > org.apache.hadoop.fs.FileSystem"
>>> >
>>> > --
>>> > Srinivas
>>> > (*-*)
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------------
>>> > You have to grow from the inside out. None can teach you, none can make
>>> > you spiritual.
>>> >                       -Narendra Nath Dutta(Swamy Vivekananda)
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------------
>>> >
>>>
>>
>>
>
>
> --
> Srinivas
> (*-*)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
> You have to grow from the inside out. None can teach you, none can make
> you spiritual.
>                       -Narendra Nath Dutta(Swamy Vivekananda)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
>

Re: Support for writing ORC file while connecting through knox

Posted by Owen O'Malley <om...@apache.org>.

If the Knox team implements the Hadoop FileSystem API, the ORC reader and
writer could use it automatically.

.. Owen

On Mon, Mar 6, 2017 at 10:17 PM, Srinivas M <sm...@gmail.com> wrote:

> Thanks Owen and Larry for your perspective on this. This information is
> helpful. I shall explore on the alternatives to meet the requirements of my
> use case for now.
>
> On a side note, it was mentioned that there are plans (or it is being
> considered) to add the knoxFS. I have a question on that. As and when such
> a client / API is made available, should the ORC implementation also has to
> be enhanced to support the knoxFS in the ORC API or should that come in by
> default. Or it would be too early to discuss on that.
>
> On Tue, Mar 7, 2017 at 12:20 AM, larry mccay <lm...@apache.org> wrote:
>
>> Thanks for adding the Knox list to this conversation, Owen!
>>
>> This is an interesting topic and one that we should define an end-to-end
>> usecase for.
>>
>> We have considered a number of things to address this at one time or
>> another and have encountered one or more roadblocks on some of them:
>>
>> * Knox (or Proxy) FileSystem implementation that would accommodate the
>> use of addition context needed to route requests through a proxy such as
>> Knox by altering the default URLs to match what is expected by Knox. There
>> was a POC of this done a while back and we can try and dust that off.
>> * Knox did have a feature for configuring the "default topology" which
>> would allow the URLs that are expected to be used with webhdfs direct to
>> work and Knox would translate the interactions into the context of the
>> configured default URLs. This feature is currently not working
>> unfortunately and we have a JIRA filed to correct that.
>> * There may be work needed in the java webhdfs client in order to
>> accommodate SPNEGO on the redirected DN interactions. Currently, the DN
>> doesn't expect the hadoop.auth cookie but a block access token instead (I
>> believe). So, when the block access token is presented to a Knox instance
>> that is configured to use the Hadoop Auth provider it doesn't find a
>> hadoop.auth cookie so it challenges the client again. This is not expected
>> in existing clients and it throws an exception. Investigation needed here
>> for most efficient way to address this.
>>
>> Incidentally, you may also consider looking at the KnoxShell client
>> classes to write a file to HDFS.
>>
>> http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details
>>
>> The example below is showing how to use the groovy based DSL for
>> establishing a "session", deleting, writing and reading files to HDFS.
>> The underlying java classes can be used directly as well as an SDK to do
>> the same.
>>
>> Uptaking the gateway-shell module can be easily done by adding a maven
>> dependency to your project for that module.
>> Additionally, the 0.12.0 release which is currently undergoing a VOTE for
>> release contains a separate client release artifact for download.
>>
>> import org.apache.hadoop.gateway.shell.Hadoop
>> import org.apache.hadoop.gateway.shell.hdfs.Hdfs
>> import groovy.json.JsonSlurper
>>
>> gateway = "https://localhost:8443/gateway/sandbox"
>> username = "guest"
>> password = "guest-password"
>> dataFile = "README"
>>
>> session = Hadoop.login( gateway, username, password )
>> Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
>> Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
>> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
>> json = (new JsonSlurper()).parseText( text )
>> println json.FileStatuses.FileStatus.pathSuffix
>> session.shutdown()
>> exit
>>
>>
>> On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org>
>> wrote:
>>
>>> Unfortunately, in the short run, you'll need to copy them locally using
>>> wget or curl and then read the ORC file using file:/// paths to use the
>>> local file system.
>>>
>>> I talked with Larry McCay from the Knox project and he said that they are
>>> considering making a KnoxFS Java client, which implements
>>> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>>>
>>> .. Owen
>>>
>>> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>>>
>>> > Hi
>>> >
>>> > I have an application that uses the Hive ORC API and to write a ORC
>>> file
>>> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
>>> > (webhdfs://host:port) to create a FileSystem Object
>>> >
>>> > fs = FileSystem.get(hdfsuri,conf,_user) ;
>>> >
>>> > While trying to connect using the Knox gateway, is there a way to still
>>> > use the Native FileSystem or should I be using the REST API calls to be
>>> > able to access the Files on HDFS ?
>>> >
>>> > If so, is there any way to read or write an ORC file in such a case,
>>> given
>>> > that the ORC Reader or Writers, needs an object of type "
>>> > org.apache.hadoop.fs.FileSystem"
>>> >
>>> > --
>>> > Srinivas
>>> > (*-*)
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------------
>>> > You have to grow from the inside out. None can teach you, none can make
>>> > you spiritual.
>>> >                       -Narendra Nath Dutta(Swamy Vivekananda)
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------
>>> > ------------------------------------------------------------------
>>> >
>>>
>>
>>
>
>
> --
> Srinivas
> (*-*)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
> You have to grow from the inside out. None can teach you, none can make
> you spiritual.
>                       -Narendra Nath Dutta(Swamy Vivekananda)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
>

Re: Support for writing ORC file while connecting through knox

Posted by Srinivas M <sm...@gmail.com>.

Thanks Owen and Larry for your perspective on this. This information is
helpful. I shall explore on the alternatives to meet the requirements of my
use case for now.

On a side note, it was mentioned that there are plans (or it is being
considered) to add the knoxFS. I have a question on that. As and when such
a client / API is made available, should the ORC implementation also has to
be enhanced to support the knoxFS in the ORC API or should that come in by
default. Or it would be too early to discuss on that.

On Tue, Mar 7, 2017 at 12:20 AM, larry mccay <lm...@apache.org> wrote:

> Thanks for adding the Knox list to this conversation, Owen!
>
> This is an interesting topic and one that we should define an end-to-end
> usecase for.
>
> We have considered a number of things to address this at one time or
> another and have encountered one or more roadblocks on some of them:
>
> * Knox (or Proxy) FileSystem implementation that would accommodate the use
> of addition context needed to route requests through a proxy such as Knox
> by altering the default URLs to match what is expected by Knox. There was a
> POC of this done a while back and we can try and dust that off.
> * Knox did have a feature for configuring the "default topology" which
> would allow the URLs that are expected to be used with webhdfs direct to
> work and Knox would translate the interactions into the context of the
> configured default URLs. This feature is currently not working
> unfortunately and we have a JIRA filed to correct that.
> * There may be work needed in the java webhdfs client in order to
> accommodate SPNEGO on the redirected DN interactions. Currently, the DN
> doesn't expect the hadoop.auth cookie but a block access token instead (I
> believe). So, when the block access token is presented to a Knox instance
> that is configured to use the Hadoop Auth provider it doesn't find a
> hadoop.auth cookie so it challenges the client again. This is not expected
> in existing clients and it throws an exception. Investigation needed here
> for most efficient way to address this.
>
> Incidentally, you may also consider looking at the KnoxShell client
> classes to write a file to HDFS.
>
> http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details
>
> The example below is showing how to use the groovy based DSL for
> establishing a "session", deleting, writing and reading files to HDFS.
> The underlying java classes can be used directly as well as an SDK to do
> the same.
>
> Uptaking the gateway-shell module can be easily done by adding a maven
> dependency to your project for that module.
> Additionally, the 0.12.0 release which is currently undergoing a VOTE for
> release contains a separate client release artifact for download.
>
> import org.apache.hadoop.gateway.shell.Hadoop
> import org.apache.hadoop.gateway.shell.hdfs.Hdfs
> import groovy.json.JsonSlurper
>
> gateway = "https://localhost:8443/gateway/sandbox"
> username = "guest"
> password = "guest-password"
> dataFile = "README"
>
> session = Hadoop.login( gateway, username, password )
> Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
> Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
> json = (new JsonSlurper()).parseText( text )
> println json.FileStatuses.FileStatus.pathSuffix
> session.shutdown()
> exit
>
>
> On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> Unfortunately, in the short run, you'll need to copy them locally using
>> wget or curl and then read the ORC file using file:/// paths to use the
>> local file system.
>>
>> I talked with Larry McCay from the Knox project and he said that they are
>> considering making a KnoxFS Java client, which implements
>> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>>
>> .. Owen
>>
>> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > I have an application that uses the Hive ORC API and to write a ORC file
>> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
>> > (webhdfs://host:port) to create a FileSystem Object
>> >
>> > fs = FileSystem.get(hdfsuri,conf,_user) ;
>> >
>> > While trying to connect using the Knox gateway, is there a way to still
>> > use the Native FileSystem or should I be using the REST API calls to be
>> > able to access the Files on HDFS ?
>> >
>> > If so, is there any way to read or write an ORC file in such a case,
>> given
>> > that the ORC Reader or Writers, needs an object of type "
>> > org.apache.hadoop.fs.FileSystem"
>> >
>> > --
>> > Srinivas
>> > (*-*)
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------------
>> > You have to grow from the inside out. None can teach you, none can make
>> > you spiritual.
>> >                       -Narendra Nath Dutta(Swamy Vivekananda)
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------------
>> >
>>
>
>


-- 
Srinivas
(*-*)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
You have to grow from the inside out. None can teach you, none can make you
spiritual.
                      -Narendra Nath Dutta(Swamy Vivekananda)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Re: Support for writing ORC file while connecting through knox

Posted by Srinivas M <sm...@gmail.com>.

Thanks Owen and Larry for your perspective on this. This information is
helpful. I shall explore on the alternatives to meet the requirements of my
use case for now.

On a side note, it was mentioned that there are plans (or it is being
considered) to add the knoxFS. I have a question on that. As and when such
a client / API is made available, should the ORC implementation also has to
be enhanced to support the knoxFS in the ORC API or should that come in by
default. Or it would be too early to discuss on that.

On Tue, Mar 7, 2017 at 12:20 AM, larry mccay <lm...@apache.org> wrote:

> Thanks for adding the Knox list to this conversation, Owen!
>
> This is an interesting topic and one that we should define an end-to-end
> usecase for.
>
> We have considered a number of things to address this at one time or
> another and have encountered one or more roadblocks on some of them:
>
> * Knox (or Proxy) FileSystem implementation that would accommodate the use
> of addition context needed to route requests through a proxy such as Knox
> by altering the default URLs to match what is expected by Knox. There was a
> POC of this done a while back and we can try and dust that off.
> * Knox did have a feature for configuring the "default topology" which
> would allow the URLs that are expected to be used with webhdfs direct to
> work and Knox would translate the interactions into the context of the
> configured default URLs. This feature is currently not working
> unfortunately and we have a JIRA filed to correct that.
> * There may be work needed in the java webhdfs client in order to
> accommodate SPNEGO on the redirected DN interactions. Currently, the DN
> doesn't expect the hadoop.auth cookie but a block access token instead (I
> believe). So, when the block access token is presented to a Knox instance
> that is configured to use the Hadoop Auth provider it doesn't find a
> hadoop.auth cookie so it challenges the client again. This is not expected
> in existing clients and it throws an exception. Investigation needed here
> for most efficient way to address this.
>
> Incidentally, you may also consider looking at the KnoxShell client
> classes to write a file to HDFS.
>
> http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details
>
> The example below is showing how to use the groovy based DSL for
> establishing a "session", deleting, writing and reading files to HDFS.
> The underlying java classes can be used directly as well as an SDK to do
> the same.
>
> Uptaking the gateway-shell module can be easily done by adding a maven
> dependency to your project for that module.
> Additionally, the 0.12.0 release which is currently undergoing a VOTE for
> release contains a separate client release artifact for download.
>
> import org.apache.hadoop.gateway.shell.Hadoop
> import org.apache.hadoop.gateway.shell.hdfs.Hdfs
> import groovy.json.JsonSlurper
>
> gateway = "https://localhost:8443/gateway/sandbox"
> username = "guest"
> password = "guest-password"
> dataFile = "README"
>
> session = Hadoop.login( gateway, username, password )
> Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
> Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
> text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
> json = (new JsonSlurper()).parseText( text )
> println json.FileStatuses.FileStatus.pathSuffix
> session.shutdown()
> exit
>
>
> On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> Unfortunately, in the short run, you'll need to copy them locally using
>> wget or curl and then read the ORC file using file:/// paths to use the
>> local file system.
>>
>> I talked with Larry McCay from the Knox project and he said that they are
>> considering making a KnoxFS Java client, which implements
>> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>>
>> .. Owen
>>
>> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > I have an application that uses the Hive ORC API and to write a ORC file
>> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
>> > (webhdfs://host:port) to create a FileSystem Object
>> >
>> > fs = FileSystem.get(hdfsuri,conf,_user) ;
>> >
>> > While trying to connect using the Knox gateway, is there a way to still
>> > use the Native FileSystem or should I be using the REST API calls to be
>> > able to access the Files on HDFS ?
>> >
>> > If so, is there any way to read or write an ORC file in such a case,
>> given
>> > that the ORC Reader or Writers, needs an object of type "
>> > org.apache.hadoop.fs.FileSystem"
>> >
>> > --
>> > Srinivas
>> > (*-*)
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------------
>> > You have to grow from the inside out. None can teach you, none can make
>> > you spiritual.
>> >                       -Narendra Nath Dutta(Swamy Vivekananda)
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------
>> > ------------------------------------------------------------------
>> >
>>
>
>


-- 
Srinivas
(*-*)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
You have to grow from the inside out. None can teach you, none can make you
spiritual.
                      -Narendra Nath Dutta(Swamy Vivekananda)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Re: Support for writing ORC file while connecting through knox

Posted by larry mccay <lm...@apache.org>.

Thanks for adding the Knox list to this conversation, Owen!

This is an interesting topic and one that we should define an end-to-end
usecase for.

We have considered a number of things to address this at one time or
another and have encountered one or more roadblocks on some of them:

* Knox (or Proxy) FileSystem implementation that would accommodate the use
of addition context needed to route requests through a proxy such as Knox
by altering the default URLs to match what is expected by Knox. There was a
POC of this done a while back and we can try and dust that off.
* Knox did have a feature for configuring the "default topology" which
would allow the URLs that are expected to be used with webhdfs direct to
work and Knox would translate the interactions into the context of the
configured default URLs. This feature is currently not working
unfortunately and we have a JIRA filed to correct that.
* There may be work needed in the java webhdfs client in order to
accommodate SPNEGO on the redirected DN interactions. Currently, the DN
doesn't expect the hadoop.auth cookie but a block access token instead (I
believe). So, when the block access token is presented to a Knox instance
that is configured to use the Hadoop Auth provider it doesn't find a
hadoop.auth cookie so it challenges the client again. This is not expected
in existing clients and it throws an exception. Investigation needed here
for most efficient way to address this.

Incidentally, you may also consider looking at the KnoxShell client classes
to write a file to HDFS.

http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details

The example below is showing how to use the groovy based DSL for
establishing a "session", deleting, writing and reading files to HDFS.
The underlying java classes can be used directly as well as an SDK to do
the same.

Uptaking the gateway-shell module can be easily done by adding a maven
dependency to your project for that module.
Additionally, the 0.12.0 release which is currently undergoing a VOTE for
release contains a separate client release artifact for download.

import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"
dataFile = "README"

session = Hadoop.login( gateway, username, password )
Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix
session.shutdown()
exit

On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org> wrote:

> Unfortunately, in the short run, you'll need to copy them locally using
> wget or curl and then read the ORC file using file:/// paths to use the
> local file system.
>
> I talked with Larry McCay from the Knox project and he said that they are
> considering making a KnoxFS Java client, which implements
> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>
> .. Owen
>
> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>
> > Hi
> >
> > I have an application that uses the Hive ORC API and to write a ORC file
> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
> > (webhdfs://host:port) to create a FileSystem Object
> >
> > fs = FileSystem.get(hdfsuri,conf,_user) ;
> >
> > While trying to connect using the Knox gateway, is there a way to still
> > use the Native FileSystem or should I be using the REST API calls to be
> > able to access the Files on HDFS ?
> >
> > If so, is there any way to read or write an ORC file in such a case,
> given
> > that the ORC Reader or Writers, needs an object of type "
> > org.apache.hadoop.fs.FileSystem"
> >
> > --
> > Srinivas
> > (*-*)
> > ------------------------------------------------------------
> > ------------------------------------------------------------
> > ------------------------------------------------------------------
> > You have to grow from the inside out. None can teach you, none can make
> > you spiritual.
> >                       -Narendra Nath Dutta(Swamy Vivekananda)
> > ------------------------------------------------------------
> > ------------------------------------------------------------
> > ------------------------------------------------------------------
> >
>

Re: Support for writing ORC file while connecting through knox

Posted by larry mccay <lm...@apache.org>.

Thanks for adding the Knox list to this conversation, Owen!

This is an interesting topic and one that we should define an end-to-end
usecase for.

We have considered a number of things to address this at one time or
another and have encountered one or more roadblocks on some of them:

* Knox (or Proxy) FileSystem implementation that would accommodate the use
of addition context needed to route requests through a proxy such as Knox
by altering the default URLs to match what is expected by Knox. There was a
POC of this done a while back and we can try and dust that off.
* Knox did have a feature for configuring the "default topology" which
would allow the URLs that are expected to be used with webhdfs direct to
work and Knox would translate the interactions into the context of the
configured default URLs. This feature is currently not working
unfortunately and we have a JIRA filed to correct that.
* There may be work needed in the java webhdfs client in order to
accommodate SPNEGO on the redirected DN interactions. Currently, the DN
doesn't expect the hadoop.auth cookie but a block access token instead (I
believe). So, when the block access token is presented to a Knox instance
that is configured to use the Hadoop Auth provider it doesn't find a
hadoop.auth cookie so it challenges the client again. This is not expected
in existing clients and it throws an exception. Investigation needed here
for most efficient way to address this.

Incidentally, you may also consider looking at the KnoxShell client classes
to write a file to HDFS.

http://knox.apache.org/books/knox-0-11-0/user-guide.html#Client+Details

The example below is showing how to use the groovy based DSL for
establishing a "session", deleting, writing and reading files to HDFS.
The underlying java classes can be used directly as well as an SDK to do
the same.

Uptaking the gateway-shell module can be easily done by adding a maven
dependency to your project for that module.
Additionally, the 0.12.0 release which is currently undergoing a VOTE for
release contains a separate client release artifact for download.

import org.apache.hadoop.gateway.shell.Hadoop
import org.apache.hadoop.gateway.shell.hdfs.Hdfs
import groovy.json.JsonSlurper

gateway = "https://localhost:8443/gateway/sandbox"
username = "guest"
password = "guest-password"
dataFile = "README"

session = Hadoop.login( gateway, username, password )
Hdfs.rm( session ).file( "/tmp/example" ).recursive().now()
Hdfs.put( session ).file( dataFile ).to( "/tmp/example/README" ).now()
text = Hdfs.ls( session ).dir( "/tmp/example" ).now().string
json = (new JsonSlurper()).parseText( text )
println json.FileStatuses.FileStatus.pathSuffix
session.shutdown()
exit

On Mon, Mar 6, 2017 at 11:45 AM, Owen O'Malley <om...@apache.org> wrote:

> Unfortunately, in the short run, you'll need to copy them locally using
> wget or curl and then read the ORC file using file:/// paths to use the
> local file system.
>
> I talked with Larry McCay from the Knox project and he said that they are
> considering making a KnoxFS Java client, which implements
> org.apache.hadoop.fs.FileSystem, that would handle this use case.
>
> .. Owen
>
> On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:
>
> > Hi
> >
> > I have an application that uses the Hive ORC API and to write a ORC file
> > to HDFS. I use the native FileSystem API and pass the WebHDFS URI
> > (webhdfs://host:port) to create a FileSystem Object
> >
> > fs = FileSystem.get(hdfsuri,conf,_user) ;
> >
> > While trying to connect using the Knox gateway, is there a way to still
> > use the Native FileSystem or should I be using the REST API calls to be
> > able to access the Files on HDFS ?
> >
> > If so, is there any way to read or write an ORC file in such a case,
> given
> > that the ORC Reader or Writers, needs an object of type "
> > org.apache.hadoop.fs.FileSystem"
> >
> > --
> > Srinivas
> > (*-*)
> > ------------------------------------------------------------
> > ------------------------------------------------------------
> > ------------------------------------------------------------------
> > You have to grow from the inside out. None can teach you, none can make
> > you spiritual.
> >                       -Narendra Nath Dutta(Swamy Vivekananda)
> > ------------------------------------------------------------
> > ------------------------------------------------------------
> > ------------------------------------------------------------------
> >
>

Re: Support for writing ORC file while connecting through knox

Posted by Owen O'Malley <om...@apache.org>.

Unfortunately, in the short run, you'll need to copy them locally using
wget or curl and then read the ORC file using file:/// paths to use the
local file system.

I talked with Larry McCay from the Knox project and he said that they are
considering making a KnoxFS Java client, which implements
org.apache.hadoop.fs.FileSystem, that would handle this use case.

.. Owen

On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:

> Hi
>
> I have an application that uses the Hive ORC API and to write a ORC file
> to HDFS. I use the native FileSystem API and pass the WebHDFS URI
> (webhdfs://host:port) to create a FileSystem Object
>
> fs = FileSystem.get(hdfsuri,conf,_user) ;
>
> While trying to connect using the Knox gateway, is there a way to still
> use the Native FileSystem or should I be using the REST API calls to be
> able to access the Files on HDFS ?
>
> If so, is there any way to read or write an ORC file in such a case, given
> that the ORC Reader or Writers, needs an object of type "
> org.apache.hadoop.fs.FileSystem"
>
> --
> Srinivas
> (*-*)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
> You have to grow from the inside out. None can teach you, none can make
> you spiritual.
>                       -Narendra Nath Dutta(Swamy Vivekananda)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
>

Re: Support for writing ORC file while connecting through knox

Posted by Owen O'Malley <om...@apache.org>.

Unfortunately, in the short run, you'll need to copy them locally using
wget or curl and then read the ORC file using file:/// paths to use the
local file system.

I talked with Larry McCay from the Knox project and he said that they are
considering making a KnoxFS Java client, which implements
org.apache.hadoop.fs.FileSystem, that would handle this use case.

.. Owen

On Mon, Mar 6, 2017 at 4:05 AM, Srinivas M <sm...@gmail.com> wrote:

> Hi
>
> I have an application that uses the Hive ORC API and to write a ORC file
> to HDFS. I use the native FileSystem API and pass the WebHDFS URI
> (webhdfs://host:port) to create a FileSystem Object
>
> fs = FileSystem.get(hdfsuri,conf,_user) ;
>
> While trying to connect using the Knox gateway, is there a way to still
> use the Native FileSystem or should I be using the REST API calls to be
> able to access the Files on HDFS ?
>
> If so, is there any way to read or write an ORC file in such a case, given
> that the ORC Reader or Writers, needs an object of type "
> org.apache.hadoop.fs.FileSystem"
>
> --
> Srinivas
> (*-*)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
> You have to grow from the inside out. None can teach you, none can make
> you spiritual.
>                       -Narendra Nath Dutta(Swamy Vivekananda)
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------------
>