You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2011/03/24 06:26:50 UTC

Direct HDFS access from a streaming job

This webpage:

http://hadoop.apache.org/common/docs/r0.18.3/streaming.html

contains this passage:

[BEGIN]
How do I process files, one per map?

As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:

	• Hadoop Streaming and custom mapper script:
		• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
		• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory 
[END]

I'm not trying to gzip files as in the example, but I would like to read files directly from HDFS into C++ streaming code, as opposed to passing those files as input through the streaming input interface (stdin).

I'm not sure how to reference HDFS from C++ though.  I mean, how would one open an ifstream to such a file?

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: Direct HDFS access from a streaming job

Posted by Keith Wiley <kw...@keithwiley.com>.
On Mar 24, 2011, at 8:31 AM, Harsh J wrote:

> Hello,
> 
> On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> Thanks.  Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".
> 
> Some would consider that a bit ugly in C/C++, but its almost the same
> thing I guess. For shell programs passed to streaming though, its a
> good solution.


I agree.  I like the API you found.  I was just saying that after posting my question, all references I found online that seemed to address that passage in the docs described the solution I then posted, that of directly invoking "hadoop get".

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Re: Direct HDFS access from a streaming job

Posted by Harsh J <qw...@gmail.com>.
Hello,

On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> Thanks.  Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".

Some would consider that a bit ugly in C/C++, but its almost the same
thing I guess. For shell programs passed to streaming though, its a
good solution.

-- 
Harsh J
http://harshj.com

Re: Direct HDFS access from a streaming job

Posted by Keith Wiley <kw...@keithwiley.com>.
On Mar 23, 2011, at 11:10 PM, Harsh J wrote:

> There is a C-HDFS API + library (called libhdfs) available @
> http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you
> can make your C++ mapper program use that?


Thanks.  Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: Direct HDFS access from a streaming job

Posted by Harsh J <qw...@gmail.com>.
There is a C-HDFS API + library (called libhdfs) available @
http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you
can make your C++ mapper program use that?

On Thu, Mar 24, 2011 at 10:56 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> This webpage:
>
> http://hadoop.apache.org/common/docs/r0.18.3/streaming.html
>
> contains this passage:
>
> [BEGIN]
> How do I process files, one per map?
>
> As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:
>
>        • Hadoop Streaming and custom mapper script:
>                • Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
>                • Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
> [END]
>
> I'm not trying to gzip files as in the example, but I would like to read files directly from HDFS into C++ streaming code, as opposed to passing those files as input through the streaming input interface (stdin).
>
> I'm not sure how to reference HDFS from C++ though.  I mean, how would one open an ifstream to such a file?
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                           --  Yoda
> ________________________________________________________________________________
>
>



-- 
Harsh J
http://harshj.com