You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Julian Bui <ju...@gmail.com> on 2013/03/17 10:39:02 UTC

executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Hi hadoop users,

I just want to verify that there is no way to put a binary on HDFS and
execute it using the hadoop java api.  If not, I would appreciate advice in
getting in creating an implementation that uses native libraries.

"In contrast to the POSIX model, there are no *sticky*, *setuid* or
*setgid* bits
for files as there is no notion of executable files."  Is there no
workaround?

A little bit more about what I'm trying to do.  I have a binary that
converts my image to another image format.  I currently want to put it in
the distributed cache and tell the reducer to execute the binary on the
data on hdfs.  However, since I can't set the execute permission bit on
that file, it seems that I cannot do that.

Since I cannot use the binary, it seems like I have to use my own
implementation to do this.  The challenge is that these libraries that I
can use to do this are .a and .so files.  Would I have to use JNI and
package the libraries in the distributed cache and then have the reducer
find and use those libraries on the task nodes?  Actually, I wouldn't want
to use JNI, I'd probably want to use java native access (JNA) to do this.
 Has anyone used JNA with hadoop and been successful?  Are there problems
I'll encounter?

Please let me know.

Thanks,
-Julian

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Yes thats correct, data locality could get lost with that approach. If
you need that and also need a specialized input format then its
probably also better to go the Java + JNI calls way completely.

On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <ju...@gmail.com> wrote:
> Ah, thanks for clarifying, Harsh,
>
> One thing that concerns me is that you wrote "input can be mere HDFS file
> path strings" which implies that the tasks will not be guaranteed to run
> with the data, is that correct?
>
> -Julian
>
>
> On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Yes, streaming lets you send in arbitrary programs and execute them
>> upon the input and helps collect the output, such as a shell script or
>> a python script (its agnostic to what you send as long as you also
>> send full launching environment and command instructions). The same
>> can be leveraged to have your MR environment start a number of map
>> tasks whose input can be mere HDFS file path strings (but not the file
>> data) and that your program leverages libhdfs (shipped along or
>> installed on all nodes) to read those paths and process them in a way
>> you see fit. Essentially, you can tweak the streaming functionalities
>> to make it do what you need it/achieve different ways of
>> parallelism/etc..
>>
>> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
>> generally been out of major use in the past few major releases we've
>> made and generally Streaming is recommended in its favor.
>>
>> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hello Harsh,
>> >
>> > Thanks for the reply.  I just want to verify that I understand your
>> > comments.
>> >
>> > It sounds like you're saying I should write a c/c++ application and get
>> > access to hdfs using libhdfs.  What I'm a little confused about is what
>> > you
>> > mean by "use a streaming program".  Do you mean I should use the hadoop
>> > streaming interface to call some native binary that I wrote?  I was not
>> > even
>> > aware that the streaming interface could execute native binaries.  I
>> > thought
>> > that anything using the hadoop streaming interface only interacts with
>> > stdin
>> > and stdout and cannot make modifications to the hdfs.  Or did you mean
>> > that
>> > I should use hadoop pipes to write a c/c++ application?
>> >
>> > Anyway, I hope that you can help me clear things up in my head.
>> >
>> > Thanks,
>> > -Julian
>> >
>> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> You're confusing two things here. HDFS is a data storage filesystem.
>> >> MR does not have anything to do with HDFS (generally speaking).
>> >>
>> >> A reducer runs as a regular JVM on a provided node, and can execute
>> >> any program you'd like it to by downloading it onto its configured
>> >> local filesystem and executing it.
>> >>
>> >> If your goal is merely to run a regular program over data that is
>> >> sitting in HDFS, that can be achieved. If your library is in C then
>> >> simply use a streaming program to run it and use libhdfs' HDFS API
>> >> (C/C++) to read data into your functions from HDFS files. Would this
>> >> not suffice?
>> >>
>> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
>> >> wrote:
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> >> > and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> >> > advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>> >> >
>> >> > "In contrast to the POSIX model, there are no sticky, setuid or
>> >> > setgid
>> >> > bits
>> >> > for files as there is no notion of executable files."  Is there no
>> >> > workaround?
>> >> >
>> >> > A little bit more about what I'm trying to do.  I have a binary that
>> >> > converts my image to another image format.  I currently want to put
>> >> > it
>> >> > in
>> >> > the distributed cache and tell the reducer to execute the binary on
>> >> > the
>> >> > data
>> >> > on hdfs.  However, since I can't set the execute permission bit on
>> >> > that
>> >> > file, it seems that I cannot do that.
>> >> >
>> >> > Since I cannot use the binary, it seems like I have to use my own
>> >> > implementation to do this.  The challenge is that these libraries
>> >> > that I
>> >> > can
>> >> > use to do this are .a and .so files.  Would I have to use JNI and
>> >> > package
>> >> > the libraries in the distributed cache and then have the reducer find
>> >> > and
>> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
>> >> > use
>> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>> >> > Has
>> >> > anyone used JNA with hadoop and been successful?  Are there problems
>> >> > I'll
>> >> > encounter?
>> >> >
>> >> > Please let me know.
>> >> >
>> >> > Thanks,
>> >> > -Julian
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Julian,

Replying to your original question.


2013/3/17 Julian Bui <ju...@gmail.com>

> Ah, thanks for clarifying, Harsh,
>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
>
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>>
>

There's support in the API for distributing native libraries or executables
and creating a symlink in the working directory of your MR processes.

This is called the distributed cache and an example can be found here:
http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms#How_to_submit_debug_script

Best regards,

Jens

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Yes thats correct, data locality could get lost with that approach. If
you need that and also need a specialized input format then its
probably also better to go the Java + JNI calls way completely.

On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <ju...@gmail.com> wrote:
> Ah, thanks for clarifying, Harsh,
>
> One thing that concerns me is that you wrote "input can be mere HDFS file
> path strings" which implies that the tasks will not be guaranteed to run
> with the data, is that correct?
>
> -Julian
>
>
> On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Yes, streaming lets you send in arbitrary programs and execute them
>> upon the input and helps collect the output, such as a shell script or
>> a python script (its agnostic to what you send as long as you also
>> send full launching environment and command instructions). The same
>> can be leveraged to have your MR environment start a number of map
>> tasks whose input can be mere HDFS file path strings (but not the file
>> data) and that your program leverages libhdfs (shipped along or
>> installed on all nodes) to read those paths and process them in a way
>> you see fit. Essentially, you can tweak the streaming functionalities
>> to make it do what you need it/achieve different ways of
>> parallelism/etc..
>>
>> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
>> generally been out of major use in the past few major releases we've
>> made and generally Streaming is recommended in its favor.
>>
>> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hello Harsh,
>> >
>> > Thanks for the reply.  I just want to verify that I understand your
>> > comments.
>> >
>> > It sounds like you're saying I should write a c/c++ application and get
>> > access to hdfs using libhdfs.  What I'm a little confused about is what
>> > you
>> > mean by "use a streaming program".  Do you mean I should use the hadoop
>> > streaming interface to call some native binary that I wrote?  I was not
>> > even
>> > aware that the streaming interface could execute native binaries.  I
>> > thought
>> > that anything using the hadoop streaming interface only interacts with
>> > stdin
>> > and stdout and cannot make modifications to the hdfs.  Or did you mean
>> > that
>> > I should use hadoop pipes to write a c/c++ application?
>> >
>> > Anyway, I hope that you can help me clear things up in my head.
>> >
>> > Thanks,
>> > -Julian
>> >
>> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> You're confusing two things here. HDFS is a data storage filesystem.
>> >> MR does not have anything to do with HDFS (generally speaking).
>> >>
>> >> A reducer runs as a regular JVM on a provided node, and can execute
>> >> any program you'd like it to by downloading it onto its configured
>> >> local filesystem and executing it.
>> >>
>> >> If your goal is merely to run a regular program over data that is
>> >> sitting in HDFS, that can be achieved. If your library is in C then
>> >> simply use a streaming program to run it and use libhdfs' HDFS API
>> >> (C/C++) to read data into your functions from HDFS files. Would this
>> >> not suffice?
>> >>
>> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
>> >> wrote:
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> >> > and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> >> > advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>> >> >
>> >> > "In contrast to the POSIX model, there are no sticky, setuid or
>> >> > setgid
>> >> > bits
>> >> > for files as there is no notion of executable files."  Is there no
>> >> > workaround?
>> >> >
>> >> > A little bit more about what I'm trying to do.  I have a binary that
>> >> > converts my image to another image format.  I currently want to put
>> >> > it
>> >> > in
>> >> > the distributed cache and tell the reducer to execute the binary on
>> >> > the
>> >> > data
>> >> > on hdfs.  However, since I can't set the execute permission bit on
>> >> > that
>> >> > file, it seems that I cannot do that.
>> >> >
>> >> > Since I cannot use the binary, it seems like I have to use my own
>> >> > implementation to do this.  The challenge is that these libraries
>> >> > that I
>> >> > can
>> >> > use to do this are .a and .so files.  Would I have to use JNI and
>> >> > package
>> >> > the libraries in the distributed cache and then have the reducer find
>> >> > and
>> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
>> >> > use
>> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>> >> > Has
>> >> > anyone used JNA with hadoop and been successful?  Are there problems
>> >> > I'll
>> >> > encounter?
>> >> >
>> >> > Please let me know.
>> >> >
>> >> > Thanks,
>> >> > -Julian
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Yes thats correct, data locality could get lost with that approach. If
you need that and also need a specialized input format then its
probably also better to go the Java + JNI calls way completely.

On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <ju...@gmail.com> wrote:
> Ah, thanks for clarifying, Harsh,
>
> One thing that concerns me is that you wrote "input can be mere HDFS file
> path strings" which implies that the tasks will not be guaranteed to run
> with the data, is that correct?
>
> -Julian
>
>
> On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Yes, streaming lets you send in arbitrary programs and execute them
>> upon the input and helps collect the output, such as a shell script or
>> a python script (its agnostic to what you send as long as you also
>> send full launching environment and command instructions). The same
>> can be leveraged to have your MR environment start a number of map
>> tasks whose input can be mere HDFS file path strings (but not the file
>> data) and that your program leverages libhdfs (shipped along or
>> installed on all nodes) to read those paths and process them in a way
>> you see fit. Essentially, you can tweak the streaming functionalities
>> to make it do what you need it/achieve different ways of
>> parallelism/etc..
>>
>> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
>> generally been out of major use in the past few major releases we've
>> made and generally Streaming is recommended in its favor.
>>
>> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hello Harsh,
>> >
>> > Thanks for the reply.  I just want to verify that I understand your
>> > comments.
>> >
>> > It sounds like you're saying I should write a c/c++ application and get
>> > access to hdfs using libhdfs.  What I'm a little confused about is what
>> > you
>> > mean by "use a streaming program".  Do you mean I should use the hadoop
>> > streaming interface to call some native binary that I wrote?  I was not
>> > even
>> > aware that the streaming interface could execute native binaries.  I
>> > thought
>> > that anything using the hadoop streaming interface only interacts with
>> > stdin
>> > and stdout and cannot make modifications to the hdfs.  Or did you mean
>> > that
>> > I should use hadoop pipes to write a c/c++ application?
>> >
>> > Anyway, I hope that you can help me clear things up in my head.
>> >
>> > Thanks,
>> > -Julian
>> >
>> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> You're confusing two things here. HDFS is a data storage filesystem.
>> >> MR does not have anything to do with HDFS (generally speaking).
>> >>
>> >> A reducer runs as a regular JVM on a provided node, and can execute
>> >> any program you'd like it to by downloading it onto its configured
>> >> local filesystem and executing it.
>> >>
>> >> If your goal is merely to run a regular program over data that is
>> >> sitting in HDFS, that can be achieved. If your library is in C then
>> >> simply use a streaming program to run it and use libhdfs' HDFS API
>> >> (C/C++) to read data into your functions from HDFS files. Would this
>> >> not suffice?
>> >>
>> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
>> >> wrote:
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> >> > and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> >> > advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>> >> >
>> >> > "In contrast to the POSIX model, there are no sticky, setuid or
>> >> > setgid
>> >> > bits
>> >> > for files as there is no notion of executable files."  Is there no
>> >> > workaround?
>> >> >
>> >> > A little bit more about what I'm trying to do.  I have a binary that
>> >> > converts my image to another image format.  I currently want to put
>> >> > it
>> >> > in
>> >> > the distributed cache and tell the reducer to execute the binary on
>> >> > the
>> >> > data
>> >> > on hdfs.  However, since I can't set the execute permission bit on
>> >> > that
>> >> > file, it seems that I cannot do that.
>> >> >
>> >> > Since I cannot use the binary, it seems like I have to use my own
>> >> > implementation to do this.  The challenge is that these libraries
>> >> > that I
>> >> > can
>> >> > use to do this are .a and .so files.  Would I have to use JNI and
>> >> > package
>> >> > the libraries in the distributed cache and then have the reducer find
>> >> > and
>> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
>> >> > use
>> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>> >> > Has
>> >> > anyone used JNA with hadoop and been successful?  Are there problems
>> >> > I'll
>> >> > encounter?
>> >> >
>> >> > Please let me know.
>> >> >
>> >> > Thanks,
>> >> > -Julian
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Yes thats correct, data locality could get lost with that approach. If
you need that and also need a specialized input format then its
probably also better to go the Java + JNI calls way completely.

On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <ju...@gmail.com> wrote:
> Ah, thanks for clarifying, Harsh,
>
> One thing that concerns me is that you wrote "input can be mere HDFS file
> path strings" which implies that the tasks will not be guaranteed to run
> with the data, is that correct?
>
> -Julian
>
>
> On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi,
>>
>> Yes, streaming lets you send in arbitrary programs and execute them
>> upon the input and helps collect the output, such as a shell script or
>> a python script (its agnostic to what you send as long as you also
>> send full launching environment and command instructions). The same
>> can be leveraged to have your MR environment start a number of map
>> tasks whose input can be mere HDFS file path strings (but not the file
>> data) and that your program leverages libhdfs (shipped along or
>> installed on all nodes) to read those paths and process them in a way
>> you see fit. Essentially, you can tweak the streaming functionalities
>> to make it do what you need it/achieve different ways of
>> parallelism/etc..
>>
>> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
>> generally been out of major use in the past few major releases we've
>> made and generally Streaming is recommended in its favor.
>>
>> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hello Harsh,
>> >
>> > Thanks for the reply.  I just want to verify that I understand your
>> > comments.
>> >
>> > It sounds like you're saying I should write a c/c++ application and get
>> > access to hdfs using libhdfs.  What I'm a little confused about is what
>> > you
>> > mean by "use a streaming program".  Do you mean I should use the hadoop
>> > streaming interface to call some native binary that I wrote?  I was not
>> > even
>> > aware that the streaming interface could execute native binaries.  I
>> > thought
>> > that anything using the hadoop streaming interface only interacts with
>> > stdin
>> > and stdout and cannot make modifications to the hdfs.  Or did you mean
>> > that
>> > I should use hadoop pipes to write a c/c++ application?
>> >
>> > Anyway, I hope that you can help me clear things up in my head.
>> >
>> > Thanks,
>> > -Julian
>> >
>> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> You're confusing two things here. HDFS is a data storage filesystem.
>> >> MR does not have anything to do with HDFS (generally speaking).
>> >>
>> >> A reducer runs as a regular JVM on a provided node, and can execute
>> >> any program you'd like it to by downloading it onto its configured
>> >> local filesystem and executing it.
>> >>
>> >> If your goal is merely to run a regular program over data that is
>> >> sitting in HDFS, that can be achieved. If your library is in C then
>> >> simply use a streaming program to run it and use libhdfs' HDFS API
>> >> (C/C++) to read data into your functions from HDFS files. Would this
>> >> not suffice?
>> >>
>> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
>> >> wrote:
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> >> > and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> >> > advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>> >> >
>> >> > "In contrast to the POSIX model, there are no sticky, setuid or
>> >> > setgid
>> >> > bits
>> >> > for files as there is no notion of executable files."  Is there no
>> >> > workaround?
>> >> >
>> >> > A little bit more about what I'm trying to do.  I have a binary that
>> >> > converts my image to another image format.  I currently want to put
>> >> > it
>> >> > in
>> >> > the distributed cache and tell the reducer to execute the binary on
>> >> > the
>> >> > data
>> >> > on hdfs.  However, since I can't set the execute permission bit on
>> >> > that
>> >> > file, it seems that I cannot do that.
>> >> >
>> >> > Since I cannot use the binary, it seems like I have to use my own
>> >> > implementation to do this.  The challenge is that these libraries
>> >> > that I
>> >> > can
>> >> > use to do this are .a and .so files.  Would I have to use JNI and
>> >> > package
>> >> > the libraries in the distributed cache and then have the reducer find
>> >> > and
>> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
>> >> > use
>> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>> >> > Has
>> >> > anyone used JNA with hadoop and been successful?  Are there problems
>> >> > I'll
>> >> > encounter?
>> >> >
>> >> > Please let me know.
>> >> >
>> >> > Thanks,
>> >> > -Julian
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Julian,

Replying to your original question.


2013/3/17 Julian Bui <ju...@gmail.com>

> Ah, thanks for clarifying, Harsh,
>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
>
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>>
>

There's support in the API for distributing native libraries or executables
and creating a symlink in the working directory of your MR processes.

This is called the distributed cache and an example can be found here:
http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms#How_to_submit_debug_script

Best regards,

Jens

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Julian,

Replying to your original question.


2013/3/17 Julian Bui <ju...@gmail.com>

> Ah, thanks for clarifying, Harsh,
>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
>
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>>
>

There's support in the API for distributing native libraries or executables
and creating a symlink in the working directory of your MR processes.

This is called the distributed cache and an example can be found here:
http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms#How_to_submit_debug_script

Best regards,

Jens

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Jens Scheidtmann <je...@gmail.com>.

Hi Julian,

Replying to your original question.


2013/3/17 Julian Bui <ju...@gmail.com>

> Ah, thanks for clarifying, Harsh,
>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
>
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>>
>

There's support in the API for distributing native libraries or executables
and creating a symlink in the working directory of your MR processes.

This is called the distributed cache and an example can be found here:
http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms#How_to_submit_debug_script

Best regards,

Jens

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Ah, thanks for clarifying, Harsh,

One thing that concerns me is that you wrote "input can be mere HDFS file
path strings" which implies that the tasks will not be guaranteed to run
with the data, is that correct?

-Julian

On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Yes, streaming lets you send in arbitrary programs and execute them
> upon the input and helps collect the output, such as a shell script or
> a python script (its agnostic to what you send as long as you also
> send full launching environment and command instructions). The same
> can be leveraged to have your MR environment start a number of map
> tasks whose input can be mere HDFS file path strings (but not the file
> data) and that your program leverages libhdfs (shipped along or
> installed on all nodes) to read those paths and process them in a way
> you see fit. Essentially, you can tweak the streaming functionalities
> to make it do what you need it/achieve different ways of
> parallelism/etc..
>
> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
> generally been out of major use in the past few major releases we've
> made and generally Streaming is recommended in its favor.
>
> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hello Harsh,
> >
> > Thanks for the reply.  I just want to verify that I understand your
> > comments.
> >
> > It sounds like you're saying I should write a c/c++ application and get
> > access to hdfs using libhdfs.  What I'm a little confused about is what
> you
> > mean by "use a streaming program".  Do you mean I should use the hadoop
> > streaming interface to call some native binary that I wrote?  I was not
> even
> > aware that the streaming interface could execute native binaries.  I
> thought
> > that anything using the hadoop streaming interface only interacts with
> stdin
> > and stdout and cannot make modifications to the hdfs.  Or did you mean
> that
> > I should use hadoop pipes to write a c/c++ application?
> >
> > Anyway, I hope that you can help me clear things up in my head.
> >
> > Thanks,
> > -Julian
> >
> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> You're confusing two things here. HDFS is a data storage filesystem.
> >> MR does not have anything to do with HDFS (generally speaking).
> >>
> >> A reducer runs as a regular JVM on a provided node, and can execute
> >> any program you'd like it to by downloading it onto its configured
> >> local filesystem and executing it.
> >>
> >> If your goal is merely to run a regular program over data that is
> >> sitting in HDFS, that can be achieved. If your library is in C then
> >> simply use a streaming program to run it and use libhdfs' HDFS API
> >> (C/C++) to read data into your functions from HDFS files. Would this
> >> not suffice?
> >>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
> >> > Hi hadoop users,
> >> >
> >> > I just want to verify that there is no way to put a binary on HDFS and
> >> > execute it using the hadoop java api.  If not, I would appreciate
> advice
> >> > in
> >> > getting in creating an implementation that uses native libraries.
> >> >
> >> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> >> > bits
> >> > for files as there is no notion of executable files."  Is there no
> >> > workaround?
> >> >
> >> > A little bit more about what I'm trying to do.  I have a binary that
> >> > converts my image to another image format.  I currently want to put it
> >> > in
> >> > the distributed cache and tell the reducer to execute the binary on
> the
> >> > data
> >> > on hdfs.  However, since I can't set the execute permission bit on
> that
> >> > file, it seems that I cannot do that.
> >> >
> >> > Since I cannot use the binary, it seems like I have to use my own
> >> > implementation to do this.  The challenge is that these libraries
> that I
> >> > can
> >> > use to do this are .a and .so files.  Would I have to use JNI and
> >> > package
> >> > the libraries in the distributed cache and then have the reducer find
> >> > and
> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
> use
> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>  Has
> >> > anyone used JNA with hadoop and been successful?  Are there problems
> >> > I'll
> >> > encounter?
> >> >
> >> > Please let me know.
> >> >
> >> > Thanks,
> >> > -Julian
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Ah, thanks for clarifying, Harsh,

One thing that concerns me is that you wrote "input can be mere HDFS file
path strings" which implies that the tasks will not be guaranteed to run
with the data, is that correct?

-Julian

On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Yes, streaming lets you send in arbitrary programs and execute them
> upon the input and helps collect the output, such as a shell script or
> a python script (its agnostic to what you send as long as you also
> send full launching environment and command instructions). The same
> can be leveraged to have your MR environment start a number of map
> tasks whose input can be mere HDFS file path strings (but not the file
> data) and that your program leverages libhdfs (shipped along or
> installed on all nodes) to read those paths and process them in a way
> you see fit. Essentially, you can tweak the streaming functionalities
> to make it do what you need it/achieve different ways of
> parallelism/etc..
>
> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
> generally been out of major use in the past few major releases we've
> made and generally Streaming is recommended in its favor.
>
> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hello Harsh,
> >
> > Thanks for the reply.  I just want to verify that I understand your
> > comments.
> >
> > It sounds like you're saying I should write a c/c++ application and get
> > access to hdfs using libhdfs.  What I'm a little confused about is what
> you
> > mean by "use a streaming program".  Do you mean I should use the hadoop
> > streaming interface to call some native binary that I wrote?  I was not
> even
> > aware that the streaming interface could execute native binaries.  I
> thought
> > that anything using the hadoop streaming interface only interacts with
> stdin
> > and stdout and cannot make modifications to the hdfs.  Or did you mean
> that
> > I should use hadoop pipes to write a c/c++ application?
> >
> > Anyway, I hope that you can help me clear things up in my head.
> >
> > Thanks,
> > -Julian
> >
> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> You're confusing two things here. HDFS is a data storage filesystem.
> >> MR does not have anything to do with HDFS (generally speaking).
> >>
> >> A reducer runs as a regular JVM on a provided node, and can execute
> >> any program you'd like it to by downloading it onto its configured
> >> local filesystem and executing it.
> >>
> >> If your goal is merely to run a regular program over data that is
> >> sitting in HDFS, that can be achieved. If your library is in C then
> >> simply use a streaming program to run it and use libhdfs' HDFS API
> >> (C/C++) to read data into your functions from HDFS files. Would this
> >> not suffice?
> >>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
> >> > Hi hadoop users,
> >> >
> >> > I just want to verify that there is no way to put a binary on HDFS and
> >> > execute it using the hadoop java api.  If not, I would appreciate
> advice
> >> > in
> >> > getting in creating an implementation that uses native libraries.
> >> >
> >> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> >> > bits
> >> > for files as there is no notion of executable files."  Is there no
> >> > workaround?
> >> >
> >> > A little bit more about what I'm trying to do.  I have a binary that
> >> > converts my image to another image format.  I currently want to put it
> >> > in
> >> > the distributed cache and tell the reducer to execute the binary on
> the
> >> > data
> >> > on hdfs.  However, since I can't set the execute permission bit on
> that
> >> > file, it seems that I cannot do that.
> >> >
> >> > Since I cannot use the binary, it seems like I have to use my own
> >> > implementation to do this.  The challenge is that these libraries
> that I
> >> > can
> >> > use to do this are .a and .so files.  Would I have to use JNI and
> >> > package
> >> > the libraries in the distributed cache and then have the reducer find
> >> > and
> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
> use
> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>  Has
> >> > anyone used JNA with hadoop and been successful?  Are there problems
> >> > I'll
> >> > encounter?
> >> >
> >> > Please let me know.
> >> >
> >> > Thanks,
> >> > -Julian
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Ah, thanks for clarifying, Harsh,

One thing that concerns me is that you wrote "input can be mere HDFS file
path strings" which implies that the tasks will not be guaranteed to run
with the data, is that correct?

-Julian

On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Yes, streaming lets you send in arbitrary programs and execute them
> upon the input and helps collect the output, such as a shell script or
> a python script (its agnostic to what you send as long as you also
> send full launching environment and command instructions). The same
> can be leveraged to have your MR environment start a number of map
> tasks whose input can be mere HDFS file path strings (but not the file
> data) and that your program leverages libhdfs (shipped along or
> installed on all nodes) to read those paths and process them in a way
> you see fit. Essentially, you can tweak the streaming functionalities
> to make it do what you need it/achieve different ways of
> parallelism/etc..
>
> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
> generally been out of major use in the past few major releases we've
> made and generally Streaming is recommended in its favor.
>
> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hello Harsh,
> >
> > Thanks for the reply.  I just want to verify that I understand your
> > comments.
> >
> > It sounds like you're saying I should write a c/c++ application and get
> > access to hdfs using libhdfs.  What I'm a little confused about is what
> you
> > mean by "use a streaming program".  Do you mean I should use the hadoop
> > streaming interface to call some native binary that I wrote?  I was not
> even
> > aware that the streaming interface could execute native binaries.  I
> thought
> > that anything using the hadoop streaming interface only interacts with
> stdin
> > and stdout and cannot make modifications to the hdfs.  Or did you mean
> that
> > I should use hadoop pipes to write a c/c++ application?
> >
> > Anyway, I hope that you can help me clear things up in my head.
> >
> > Thanks,
> > -Julian
> >
> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> You're confusing two things here. HDFS is a data storage filesystem.
> >> MR does not have anything to do with HDFS (generally speaking).
> >>
> >> A reducer runs as a regular JVM on a provided node, and can execute
> >> any program you'd like it to by downloading it onto its configured
> >> local filesystem and executing it.
> >>
> >> If your goal is merely to run a regular program over data that is
> >> sitting in HDFS, that can be achieved. If your library is in C then
> >> simply use a streaming program to run it and use libhdfs' HDFS API
> >> (C/C++) to read data into your functions from HDFS files. Would this
> >> not suffice?
> >>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
> >> > Hi hadoop users,
> >> >
> >> > I just want to verify that there is no way to put a binary on HDFS and
> >> > execute it using the hadoop java api.  If not, I would appreciate
> advice
> >> > in
> >> > getting in creating an implementation that uses native libraries.
> >> >
> >> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> >> > bits
> >> > for files as there is no notion of executable files."  Is there no
> >> > workaround?
> >> >
> >> > A little bit more about what I'm trying to do.  I have a binary that
> >> > converts my image to another image format.  I currently want to put it
> >> > in
> >> > the distributed cache and tell the reducer to execute the binary on
> the
> >> > data
> >> > on hdfs.  However, since I can't set the execute permission bit on
> that
> >> > file, it seems that I cannot do that.
> >> >
> >> > Since I cannot use the binary, it seems like I have to use my own
> >> > implementation to do this.  The challenge is that these libraries
> that I
> >> > can
> >> > use to do this are .a and .so files.  Would I have to use JNI and
> >> > package
> >> > the libraries in the distributed cache and then have the reducer find
> >> > and
> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
> use
> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>  Has
> >> > anyone used JNA with hadoop and been successful?  Are there problems
> >> > I'll
> >> > encounter?
> >> >
> >> > Please let me know.
> >> >
> >> > Thanks,
> >> > -Julian
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Ah, thanks for clarifying, Harsh,

One thing that concerns me is that you wrote "input can be mere HDFS file
path strings" which implies that the tasks will not be guaranteed to run
with the data, is that correct?

-Julian

On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi,
>
> Yes, streaming lets you send in arbitrary programs and execute them
> upon the input and helps collect the output, such as a shell script or
> a python script (its agnostic to what you send as long as you also
> send full launching environment and command instructions). The same
> can be leveraged to have your MR environment start a number of map
> tasks whose input can be mere HDFS file path strings (but not the file
> data) and that your program leverages libhdfs (shipped along or
> installed on all nodes) to read those paths and process them in a way
> you see fit. Essentially, you can tweak the streaming functionalities
> to make it do what you need it/achieve different ways of
> parallelism/etc..
>
> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
> generally been out of major use in the past few major releases we've
> made and generally Streaming is recommended in its favor.
>
> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hello Harsh,
> >
> > Thanks for the reply.  I just want to verify that I understand your
> > comments.
> >
> > It sounds like you're saying I should write a c/c++ application and get
> > access to hdfs using libhdfs.  What I'm a little confused about is what
> you
> > mean by "use a streaming program".  Do you mean I should use the hadoop
> > streaming interface to call some native binary that I wrote?  I was not
> even
> > aware that the streaming interface could execute native binaries.  I
> thought
> > that anything using the hadoop streaming interface only interacts with
> stdin
> > and stdout and cannot make modifications to the hdfs.  Or did you mean
> that
> > I should use hadoop pipes to write a c/c++ application?
> >
> > Anyway, I hope that you can help me clear things up in my head.
> >
> > Thanks,
> > -Julian
> >
> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> You're confusing two things here. HDFS is a data storage filesystem.
> >> MR does not have anything to do with HDFS (generally speaking).
> >>
> >> A reducer runs as a regular JVM on a provided node, and can execute
> >> any program you'd like it to by downloading it onto its configured
> >> local filesystem and executing it.
> >>
> >> If your goal is merely to run a regular program over data that is
> >> sitting in HDFS, that can be achieved. If your library is in C then
> >> simply use a streaming program to run it and use libhdfs' HDFS API
> >> (C/C++) to read data into your functions from HDFS files. Would this
> >> not suffice?
> >>
> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com>
> wrote:
> >> > Hi hadoop users,
> >> >
> >> > I just want to verify that there is no way to put a binary on HDFS and
> >> > execute it using the hadoop java api.  If not, I would appreciate
> advice
> >> > in
> >> > getting in creating an implementation that uses native libraries.
> >> >
> >> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> >> > bits
> >> > for files as there is no notion of executable files."  Is there no
> >> > workaround?
> >> >
> >> > A little bit more about what I'm trying to do.  I have a binary that
> >> > converts my image to another image format.  I currently want to put it
> >> > in
> >> > the distributed cache and tell the reducer to execute the binary on
> the
> >> > data
> >> > on hdfs.  However, since I can't set the execute permission bit on
> that
> >> > file, it seems that I cannot do that.
> >> >
> >> > Since I cannot use the binary, it seems like I have to use my own
> >> > implementation to do this.  The challenge is that these libraries
> that I
> >> > can
> >> > use to do this are .a and .so files.  Would I have to use JNI and
> >> > package
> >> > the libraries in the distributed cache and then have the reducer find
> >> > and
> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
> use
> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>  Has
> >> > anyone used JNA with hadoop and been successful?  Are there problems
> >> > I'll
> >> > encounter?
> >> >
> >> > Please let me know.
> >> >
> >> > Thanks,
> >> > -Julian
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes, streaming lets you send in arbitrary programs and execute them
upon the input and helps collect the output, such as a shell script or
a python script (its agnostic to what you send as long as you also
send full launching environment and command instructions). The same
can be leveraged to have your MR environment start a number of map
tasks whose input can be mere HDFS file path strings (but not the file
data) and that your program leverages libhdfs (shipped along or
installed on all nodes) to read those paths and process them in a way
you see fit. Essentially, you can tweak the streaming functionalities
to make it do what you need it/achieve different ways of
parallelism/etc..

Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
generally been out of major use in the past few major releases we've
made and generally Streaming is recommended in its favor.

On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> Hello Harsh,
>
> Thanks for the reply.  I just want to verify that I understand your
> comments.
>
> It sounds like you're saying I should write a c/c++ application and get
> access to hdfs using libhdfs.  What I'm a little confused about is what you
> mean by "use a streaming program".  Do you mean I should use the hadoop
> streaming interface to call some native binary that I wrote?  I was not even
> aware that the streaming interface could execute native binaries.  I thought
> that anything using the hadoop streaming interface only interacts with stdin
> and stdout and cannot make modifications to the hdfs.  Or did you mean that
> I should use hadoop pipes to write a c/c++ application?
>
> Anyway, I hope that you can help me clear things up in my head.
>
> Thanks,
> -Julian
>
> On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> You're confusing two things here. HDFS is a data storage filesystem.
>> MR does not have anything to do with HDFS (generally speaking).
>>
>> A reducer runs as a regular JVM on a provided node, and can execute
>> any program you'd like it to by downloading it onto its configured
>> local filesystem and executing it.
>>
>> If your goal is merely to run a regular program over data that is
>> sitting in HDFS, that can be achieved. If your library is in C then
>> simply use a streaming program to run it and use libhdfs' HDFS API
>> (C/C++) to read data into your functions from HDFS files. Would this
>> not suffice?
>>
>> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I just want to verify that there is no way to put a binary on HDFS and
>> > execute it using the hadoop java api.  If not, I would appreciate advice
>> > in
>> > getting in creating an implementation that uses native libraries.
>> >
>> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
>> > bits
>> > for files as there is no notion of executable files."  Is there no
>> > workaround?
>> >
>> > A little bit more about what I'm trying to do.  I have a binary that
>> > converts my image to another image format.  I currently want to put it
>> > in
>> > the distributed cache and tell the reducer to execute the binary on the
>> > data
>> > on hdfs.  However, since I can't set the execute permission bit on that
>> > file, it seems that I cannot do that.
>> >
>> > Since I cannot use the binary, it seems like I have to use my own
>> > implementation to do this.  The challenge is that these libraries that I
>> > can
>> > use to do this are .a and .so files.  Would I have to use JNI and
>> > package
>> > the libraries in the distributed cache and then have the reducer find
>> > and
>> > use those libraries on the task nodes?  Actually, I wouldn't want to use
>> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
>> > anyone used JNA with hadoop and been successful?  Are there problems
>> > I'll
>> > encounter?
>> >
>> > Please let me know.
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes, streaming lets you send in arbitrary programs and execute them
upon the input and helps collect the output, such as a shell script or
a python script (its agnostic to what you send as long as you also
send full launching environment and command instructions). The same
can be leveraged to have your MR environment start a number of map
tasks whose input can be mere HDFS file path strings (but not the file
data) and that your program leverages libhdfs (shipped along or
installed on all nodes) to read those paths and process them in a way
you see fit. Essentially, you can tweak the streaming functionalities
to make it do what you need it/achieve different ways of
parallelism/etc..

Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
generally been out of major use in the past few major releases we've
made and generally Streaming is recommended in its favor.

On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> Hello Harsh,
>
> Thanks for the reply.  I just want to verify that I understand your
> comments.
>
> It sounds like you're saying I should write a c/c++ application and get
> access to hdfs using libhdfs.  What I'm a little confused about is what you
> mean by "use a streaming program".  Do you mean I should use the hadoop
> streaming interface to call some native binary that I wrote?  I was not even
> aware that the streaming interface could execute native binaries.  I thought
> that anything using the hadoop streaming interface only interacts with stdin
> and stdout and cannot make modifications to the hdfs.  Or did you mean that
> I should use hadoop pipes to write a c/c++ application?
>
> Anyway, I hope that you can help me clear things up in my head.
>
> Thanks,
> -Julian
>
> On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> You're confusing two things here. HDFS is a data storage filesystem.
>> MR does not have anything to do with HDFS (generally speaking).
>>
>> A reducer runs as a regular JVM on a provided node, and can execute
>> any program you'd like it to by downloading it onto its configured
>> local filesystem and executing it.
>>
>> If your goal is merely to run a regular program over data that is
>> sitting in HDFS, that can be achieved. If your library is in C then
>> simply use a streaming program to run it and use libhdfs' HDFS API
>> (C/C++) to read data into your functions from HDFS files. Would this
>> not suffice?
>>
>> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I just want to verify that there is no way to put a binary on HDFS and
>> > execute it using the hadoop java api.  If not, I would appreciate advice
>> > in
>> > getting in creating an implementation that uses native libraries.
>> >
>> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
>> > bits
>> > for files as there is no notion of executable files."  Is there no
>> > workaround?
>> >
>> > A little bit more about what I'm trying to do.  I have a binary that
>> > converts my image to another image format.  I currently want to put it
>> > in
>> > the distributed cache and tell the reducer to execute the binary on the
>> > data
>> > on hdfs.  However, since I can't set the execute permission bit on that
>> > file, it seems that I cannot do that.
>> >
>> > Since I cannot use the binary, it seems like I have to use my own
>> > implementation to do this.  The challenge is that these libraries that I
>> > can
>> > use to do this are .a and .so files.  Would I have to use JNI and
>> > package
>> > the libraries in the distributed cache and then have the reducer find
>> > and
>> > use those libraries on the task nodes?  Actually, I wouldn't want to use
>> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
>> > anyone used JNA with hadoop and been successful?  Are there problems
>> > I'll
>> > encounter?
>> >
>> > Please let me know.
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes, streaming lets you send in arbitrary programs and execute them
upon the input and helps collect the output, such as a shell script or
a python script (its agnostic to what you send as long as you also
send full launching environment and command instructions). The same
can be leveraged to have your MR environment start a number of map
tasks whose input can be mere HDFS file path strings (but not the file
data) and that your program leverages libhdfs (shipped along or
installed on all nodes) to read those paths and process them in a way
you see fit. Essentially, you can tweak the streaming functionalities
to make it do what you need it/achieve different ways of
parallelism/etc..

Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
generally been out of major use in the past few major releases we've
made and generally Streaming is recommended in its favor.

On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> Hello Harsh,
>
> Thanks for the reply.  I just want to verify that I understand your
> comments.
>
> It sounds like you're saying I should write a c/c++ application and get
> access to hdfs using libhdfs.  What I'm a little confused about is what you
> mean by "use a streaming program".  Do you mean I should use the hadoop
> streaming interface to call some native binary that I wrote?  I was not even
> aware that the streaming interface could execute native binaries.  I thought
> that anything using the hadoop streaming interface only interacts with stdin
> and stdout and cannot make modifications to the hdfs.  Or did you mean that
> I should use hadoop pipes to write a c/c++ application?
>
> Anyway, I hope that you can help me clear things up in my head.
>
> Thanks,
> -Julian
>
> On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> You're confusing two things here. HDFS is a data storage filesystem.
>> MR does not have anything to do with HDFS (generally speaking).
>>
>> A reducer runs as a regular JVM on a provided node, and can execute
>> any program you'd like it to by downloading it onto its configured
>> local filesystem and executing it.
>>
>> If your goal is merely to run a regular program over data that is
>> sitting in HDFS, that can be achieved. If your library is in C then
>> simply use a streaming program to run it and use libhdfs' HDFS API
>> (C/C++) to read data into your functions from HDFS files. Would this
>> not suffice?
>>
>> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I just want to verify that there is no way to put a binary on HDFS and
>> > execute it using the hadoop java api.  If not, I would appreciate advice
>> > in
>> > getting in creating an implementation that uses native libraries.
>> >
>> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
>> > bits
>> > for files as there is no notion of executable files."  Is there no
>> > workaround?
>> >
>> > A little bit more about what I'm trying to do.  I have a binary that
>> > converts my image to another image format.  I currently want to put it
>> > in
>> > the distributed cache and tell the reducer to execute the binary on the
>> > data
>> > on hdfs.  However, since I can't set the execute permission bit on that
>> > file, it seems that I cannot do that.
>> >
>> > Since I cannot use the binary, it seems like I have to use my own
>> > implementation to do this.  The challenge is that these libraries that I
>> > can
>> > use to do this are .a and .so files.  Would I have to use JNI and
>> > package
>> > the libraries in the distributed cache and then have the reducer find
>> > and
>> > use those libraries on the task nodes?  Actually, I wouldn't want to use
>> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
>> > anyone used JNA with hadoop and been successful?  Are there problems
>> > I'll
>> > encounter?
>> >
>> > Please let me know.
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

Yes, streaming lets you send in arbitrary programs and execute them
upon the input and helps collect the output, such as a shell script or
a python script (its agnostic to what you send as long as you also
send full launching environment and command instructions). The same
can be leveraged to have your MR environment start a number of map
tasks whose input can be mere HDFS file path strings (but not the file
data) and that your program leverages libhdfs (shipped along or
installed on all nodes) to read those paths and process them in a way
you see fit. Essentially, you can tweak the streaming functionalities
to make it do what you need it/achieve different ways of
parallelism/etc..

Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
generally been out of major use in the past few major releases we've
made and generally Streaming is recommended in its favor.

On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <ju...@gmail.com> wrote:
> Hello Harsh,
>
> Thanks for the reply.  I just want to verify that I understand your
> comments.
>
> It sounds like you're saying I should write a c/c++ application and get
> access to hdfs using libhdfs.  What I'm a little confused about is what you
> mean by "use a streaming program".  Do you mean I should use the hadoop
> streaming interface to call some native binary that I wrote?  I was not even
> aware that the streaming interface could execute native binaries.  I thought
> that anything using the hadoop streaming interface only interacts with stdin
> and stdout and cannot make modifications to the hdfs.  Or did you mean that
> I should use hadoop pipes to write a c/c++ application?
>
> Anyway, I hope that you can help me clear things up in my head.
>
> Thanks,
> -Julian
>
> On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> You're confusing two things here. HDFS is a data storage filesystem.
>> MR does not have anything to do with HDFS (generally speaking).
>>
>> A reducer runs as a regular JVM on a provided node, and can execute
>> any program you'd like it to by downloading it onto its configured
>> local filesystem and executing it.
>>
>> If your goal is merely to run a regular program over data that is
>> sitting in HDFS, that can be achieved. If your library is in C then
>> simply use a streaming program to run it and use libhdfs' HDFS API
>> (C/C++) to read data into your functions from HDFS files. Would this
>> not suffice?
>>
>> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
>> > Hi hadoop users,
>> >
>> > I just want to verify that there is no way to put a binary on HDFS and
>> > execute it using the hadoop java api.  If not, I would appreciate advice
>> > in
>> > getting in creating an implementation that uses native libraries.
>> >
>> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
>> > bits
>> > for files as there is no notion of executable files."  Is there no
>> > workaround?
>> >
>> > A little bit more about what I'm trying to do.  I have a binary that
>> > converts my image to another image format.  I currently want to put it
>> > in
>> > the distributed cache and tell the reducer to execute the binary on the
>> > data
>> > on hdfs.  However, since I can't set the execute permission bit on that
>> > file, it seems that I cannot do that.
>> >
>> > Since I cannot use the binary, it seems like I have to use my own
>> > implementation to do this.  The challenge is that these libraries that I
>> > can
>> > use to do this are .a and .so files.  Would I have to use JNI and
>> > package
>> > the libraries in the distributed cache and then have the reducer find
>> > and
>> > use those libraries on the task nodes?  Actually, I wouldn't want to use
>> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
>> > anyone used JNA with hadoop and been successful?  Are there problems
>> > I'll
>> > encounter?
>> >
>> > Please let me know.
>> >
>> > Thanks,
>> > -Julian
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Hello Harsh,

Thanks for the reply.  I just want to verify that I understand your
comments.

It sounds like you're saying I should write a c/c++ application and get
access to hdfs using libhdfs.  What I'm a little confused about is what you
mean by "use a streaming program".  Do you mean I should use the hadoop
streaming interface to call some native binary that I wrote?  I was not
even aware that the streaming interface could execute native binaries.  I
thought that anything using the hadoop streaming interface only interacts
with stdin and stdout and cannot make modifications to the hdfs.  Or did
you mean that I should use hadoop pipes to write a c/c++ application?

Anyway, I hope that you can help me clear things up in my head.

Thanks,
-Julian

On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:

> You're confusing two things here. HDFS is a data storage filesystem.
> MR does not have anything to do with HDFS (generally speaking).
>
> A reducer runs as a regular JVM on a provided node, and can execute
> any program you'd like it to by downloading it onto its configured
> local filesystem and executing it.
>
> If your goal is merely to run a regular program over data that is
> sitting in HDFS, that can be achieved. If your library is in C then
> simply use a streaming program to run it and use libhdfs' HDFS API
> (C/C++) to read data into your functions from HDFS files. Would this
> not suffice?
>
> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I just want to verify that there is no way to put a binary on HDFS and
> > execute it using the hadoop java api.  If not, I would appreciate advice
> in
> > getting in creating an implementation that uses native libraries.
> >
> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> bits
> > for files as there is no notion of executable files."  Is there no
> > workaround?
> >
> > A little bit more about what I'm trying to do.  I have a binary that
> > converts my image to another image format.  I currently want to put it in
> > the distributed cache and tell the reducer to execute the binary on the
> data
> > on hdfs.  However, since I can't set the execute permission bit on that
> > file, it seems that I cannot do that.
> >
> > Since I cannot use the binary, it seems like I have to use my own
> > implementation to do this.  The challenge is that these libraries that I
> can
> > use to do this are .a and .so files.  Would I have to use JNI and package
> > the libraries in the distributed cache and then have the reducer find and
> > use those libraries on the task nodes?  Actually, I wouldn't want to use
> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
> > anyone used JNA with hadoop and been successful?  Are there problems I'll
> > encounter?
> >
> > Please let me know.
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Hello Harsh,

Thanks for the reply.  I just want to verify that I understand your
comments.

It sounds like you're saying I should write a c/c++ application and get
access to hdfs using libhdfs.  What I'm a little confused about is what you
mean by "use a streaming program".  Do you mean I should use the hadoop
streaming interface to call some native binary that I wrote?  I was not
even aware that the streaming interface could execute native binaries.  I
thought that anything using the hadoop streaming interface only interacts
with stdin and stdout and cannot make modifications to the hdfs.  Or did
you mean that I should use hadoop pipes to write a c/c++ application?

Anyway, I hope that you can help me clear things up in my head.

Thanks,
-Julian

On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:

> You're confusing two things here. HDFS is a data storage filesystem.
> MR does not have anything to do with HDFS (generally speaking).
>
> A reducer runs as a regular JVM on a provided node, and can execute
> any program you'd like it to by downloading it onto its configured
> local filesystem and executing it.
>
> If your goal is merely to run a regular program over data that is
> sitting in HDFS, that can be achieved. If your library is in C then
> simply use a streaming program to run it and use libhdfs' HDFS API
> (C/C++) to read data into your functions from HDFS files. Would this
> not suffice?
>
> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I just want to verify that there is no way to put a binary on HDFS and
> > execute it using the hadoop java api.  If not, I would appreciate advice
> in
> > getting in creating an implementation that uses native libraries.
> >
> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> bits
> > for files as there is no notion of executable files."  Is there no
> > workaround?
> >
> > A little bit more about what I'm trying to do.  I have a binary that
> > converts my image to another image format.  I currently want to put it in
> > the distributed cache and tell the reducer to execute the binary on the
> data
> > on hdfs.  However, since I can't set the execute permission bit on that
> > file, it seems that I cannot do that.
> >
> > Since I cannot use the binary, it seems like I have to use my own
> > implementation to do this.  The challenge is that these libraries that I
> can
> > use to do this are .a and .so files.  Would I have to use JNI and package
> > the libraries in the distributed cache and then have the reducer find and
> > use those libraries on the task nodes?  Actually, I wouldn't want to use
> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
> > anyone used JNA with hadoop and been successful?  Are there problems I'll
> > encounter?
> >
> > Please let me know.
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Hello Harsh,

Thanks for the reply.  I just want to verify that I understand your
comments.

It sounds like you're saying I should write a c/c++ application and get
access to hdfs using libhdfs.  What I'm a little confused about is what you
mean by "use a streaming program".  Do you mean I should use the hadoop
streaming interface to call some native binary that I wrote?  I was not
even aware that the streaming interface could execute native binaries.  I
thought that anything using the hadoop streaming interface only interacts
with stdin and stdout and cannot make modifications to the hdfs.  Or did
you mean that I should use hadoop pipes to write a c/c++ application?

Anyway, I hope that you can help me clear things up in my head.

Thanks,
-Julian

On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:

> You're confusing two things here. HDFS is a data storage filesystem.
> MR does not have anything to do with HDFS (generally speaking).
>
> A reducer runs as a regular JVM on a provided node, and can execute
> any program you'd like it to by downloading it onto its configured
> local filesystem and executing it.
>
> If your goal is merely to run a regular program over data that is
> sitting in HDFS, that can be achieved. If your library is in C then
> simply use a streaming program to run it and use libhdfs' HDFS API
> (C/C++) to read data into your functions from HDFS files. Would this
> not suffice?
>
> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I just want to verify that there is no way to put a binary on HDFS and
> > execute it using the hadoop java api.  If not, I would appreciate advice
> in
> > getting in creating an implementation that uses native libraries.
> >
> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> bits
> > for files as there is no notion of executable files."  Is there no
> > workaround?
> >
> > A little bit more about what I'm trying to do.  I have a binary that
> > converts my image to another image format.  I currently want to put it in
> > the distributed cache and tell the reducer to execute the binary on the
> data
> > on hdfs.  However, since I can't set the execute permission bit on that
> > file, it seems that I cannot do that.
> >
> > Since I cannot use the binary, it seems like I have to use my own
> > implementation to do this.  The challenge is that these libraries that I
> can
> > use to do this are .a and .so files.  Would I have to use JNI and package
> > the libraries in the distributed cache and then have the reducer find and
> > use those libraries on the task nodes?  Actually, I wouldn't want to use
> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
> > anyone used JNA with hadoop and been successful?  Are there problems I'll
> > encounter?
> >
> > Please let me know.
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Julian Bui <ju...@gmail.com>.

Hello Harsh,

Thanks for the reply.  I just want to verify that I understand your
comments.

It sounds like you're saying I should write a c/c++ application and get
access to hdfs using libhdfs.  What I'm a little confused about is what you
mean by "use a streaming program".  Do you mean I should use the hadoop
streaming interface to call some native binary that I wrote?  I was not
even aware that the streaming interface could execute native binaries.  I
thought that anything using the hadoop streaming interface only interacts
with stdin and stdout and cannot make modifications to the hdfs.  Or did
you mean that I should use hadoop pipes to write a c/c++ application?

Anyway, I hope that you can help me clear things up in my head.

Thanks,
-Julian

On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <ha...@cloudera.com> wrote:

> You're confusing two things here. HDFS is a data storage filesystem.
> MR does not have anything to do with HDFS (generally speaking).
>
> A reducer runs as a regular JVM on a provided node, and can execute
> any program you'd like it to by downloading it onto its configured
> local filesystem and executing it.
>
> If your goal is merely to run a regular program over data that is
> sitting in HDFS, that can be achieved. If your library is in C then
> simply use a streaming program to run it and use libhdfs' HDFS API
> (C/C++) to read data into your functions from HDFS files. Would this
> not suffice?
>
> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> > Hi hadoop users,
> >
> > I just want to verify that there is no way to put a binary on HDFS and
> > execute it using the hadoop java api.  If not, I would appreciate advice
> in
> > getting in creating an implementation that uses native libraries.
> >
> > "In contrast to the POSIX model, there are no sticky, setuid or setgid
> bits
> > for files as there is no notion of executable files."  Is there no
> > workaround?
> >
> > A little bit more about what I'm trying to do.  I have a binary that
> > converts my image to another image format.  I currently want to put it in
> > the distributed cache and tell the reducer to execute the binary on the
> data
> > on hdfs.  However, since I can't set the execute permission bit on that
> > file, it seems that I cannot do that.
> >
> > Since I cannot use the binary, it seems like I have to use my own
> > implementation to do this.  The challenge is that these libraries that I
> can
> > use to do this are .a and .so files.  Would I have to use JNI and package
> > the libraries in the distributed cache and then have the reducer find and
> > use those libraries on the task nodes?  Actually, I wouldn't want to use
> > JNI, I'd probably want to use java native access (JNA) to do this.  Has
> > anyone used JNA with hadoop and been successful?  Are there problems I'll
> > encounter?
> >
> > Please let me know.
> >
> > Thanks,
> > -Julian
>
>
>
> --
> Harsh J
>

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

You're confusing two things here. HDFS is a data storage filesystem.
MR does not have anything to do with HDFS (generally speaking).

A reducer runs as a regular JVM on a provided node, and can execute
any program you'd like it to by downloading it onto its configured
local filesystem and executing it.

If your goal is merely to run a regular program over data that is
sitting in HDFS, that can be achieved. If your library is in C then
simply use a streaming program to run it and use libhdfs' HDFS API
(C/C++) to read data into your functions from HDFS files. Would this
not suffice?

On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I just want to verify that there is no way to put a binary on HDFS and
> execute it using the hadoop java api.  If not, I would appreciate advice in
> getting in creating an implementation that uses native libraries.
>
> "In contrast to the POSIX model, there are no sticky, setuid or setgid bits
> for files as there is no notion of executable files."  Is there no
> workaround?
>
> A little bit more about what I'm trying to do.  I have a binary that
> converts my image to another image format.  I currently want to put it in
> the distributed cache and tell the reducer to execute the binary on the data
> on hdfs.  However, since I can't set the execute permission bit on that
> file, it seems that I cannot do that.
>
> Since I cannot use the binary, it seems like I have to use my own
> implementation to do this.  The challenge is that these libraries that I can
> use to do this are .a and .so files.  Would I have to use JNI and package
> the libraries in the distributed cache and then have the reducer find and
> use those libraries on the task nodes?  Actually, I wouldn't want to use
> JNI, I'd probably want to use java native access (JNA) to do this.  Has
> anyone used JNA with hadoop and been successful?  Are there problems I'll
> encounter?
>
> Please let me know.
>
> Thanks,
> -Julian



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

You're confusing two things here. HDFS is a data storage filesystem.
MR does not have anything to do with HDFS (generally speaking).

A reducer runs as a regular JVM on a provided node, and can execute
any program you'd like it to by downloading it onto its configured
local filesystem and executing it.

If your goal is merely to run a regular program over data that is
sitting in HDFS, that can be achieved. If your library is in C then
simply use a streaming program to run it and use libhdfs' HDFS API
(C/C++) to read data into your functions from HDFS files. Would this
not suffice?

On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I just want to verify that there is no way to put a binary on HDFS and
> execute it using the hadoop java api.  If not, I would appreciate advice in
> getting in creating an implementation that uses native libraries.
>
> "In contrast to the POSIX model, there are no sticky, setuid or setgid bits
> for files as there is no notion of executable files."  Is there no
> workaround?
>
> A little bit more about what I'm trying to do.  I have a binary that
> converts my image to another image format.  I currently want to put it in
> the distributed cache and tell the reducer to execute the binary on the data
> on hdfs.  However, since I can't set the execute permission bit on that
> file, it seems that I cannot do that.
>
> Since I cannot use the binary, it seems like I have to use my own
> implementation to do this.  The challenge is that these libraries that I can
> use to do this are .a and .so files.  Would I have to use JNI and package
> the libraries in the distributed cache and then have the reducer find and
> use those libraries on the task nodes?  Actually, I wouldn't want to use
> JNI, I'd probably want to use java native access (JNA) to do this.  Has
> anyone used JNA with hadoop and been successful?  Are there problems I'll
> encounter?
>
> Please let me know.
>
> Thanks,
> -Julian



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

You're confusing two things here. HDFS is a data storage filesystem.
MR does not have anything to do with HDFS (generally speaking).

A reducer runs as a regular JVM on a provided node, and can execute
any program you'd like it to by downloading it onto its configured
local filesystem and executing it.

If your goal is merely to run a regular program over data that is
sitting in HDFS, that can be achieved. If your library is in C then
simply use a streaming program to run it and use libhdfs' HDFS API
(C/C++) to read data into your functions from HDFS files. Would this
not suffice?

On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I just want to verify that there is no way to put a binary on HDFS and
> execute it using the hadoop java api.  If not, I would appreciate advice in
> getting in creating an implementation that uses native libraries.
>
> "In contrast to the POSIX model, there are no sticky, setuid or setgid bits
> for files as there is no notion of executable files."  Is there no
> workaround?
>
> A little bit more about what I'm trying to do.  I have a binary that
> converts my image to another image format.  I currently want to put it in
> the distributed cache and tell the reducer to execute the binary on the data
> on hdfs.  However, since I can't set the execute permission bit on that
> file, it seems that I cannot do that.
>
> Since I cannot use the binary, it seems like I have to use my own
> implementation to do this.  The challenge is that these libraries that I can
> use to do this are .a and .so files.  Would I have to use JNI and package
> the libraries in the distributed cache and then have the reducer find and
> use those libraries on the task nodes?  Actually, I wouldn't want to use
> JNI, I'd probably want to use java native access (JNA) to do this.  Has
> anyone used JNA with hadoop and been successful?  Are there problems I'll
> encounter?
>
> Please let me know.
>
> Thanks,
> -Julian



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Posted by Harsh J <ha...@cloudera.com>.

You're confusing two things here. HDFS is a data storage filesystem.
MR does not have anything to do with HDFS (generally speaking).

A reducer runs as a regular JVM on a provided node, and can execute
any program you'd like it to by downloading it onto its configured
local filesystem and executing it.

If your goal is merely to run a regular program over data that is
sitting in HDFS, that can be achieved. If your library is in C then
simply use a streaming program to run it and use libhdfs' HDFS API
(C/C++) to read data into your functions from HDFS files. Would this
not suffice?

On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <ju...@gmail.com> wrote:
> Hi hadoop users,
>
> I just want to verify that there is no way to put a binary on HDFS and
> execute it using the hadoop java api.  If not, I would appreciate advice in
> getting in creating an implementation that uses native libraries.
>
> "In contrast to the POSIX model, there are no sticky, setuid or setgid bits
> for files as there is no notion of executable files."  Is there no
> workaround?
>
> A little bit more about what I'm trying to do.  I have a binary that
> converts my image to another image format.  I currently want to put it in
> the distributed cache and tell the reducer to execute the binary on the data
> on hdfs.  However, since I can't set the execute permission bit on that
> file, it seems that I cannot do that.
>
> Since I cannot use the binary, it seems like I have to use my own
> implementation to do this.  The challenge is that these libraries that I can
> use to do this are .a and .so files.  Would I have to use JNI and package
> the libraries in the distributed cache and then have the reducer find and
> use those libraries on the task nodes?  Actually, I wouldn't want to use
> JNI, I'd probably want to use java native access (JNA) to do this.  Has
> anyone used JNA with hadoop and been successful?  Are there problems I'll
> encounter?
>
> Please let me know.
>
> Thanks,
> -Julian



--
Harsh J