You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2011/03/24 06:26:50 UTC
Direct HDFS access from a streaming job
This webpage:
http://hadoop.apache.org/common/docs/r0.18.3/streaming.html
contains this passage:
[BEGIN]
How do I process files, one per map?
As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:
• Hadoop Streaming and custom mapper script:
• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
[END]
I'm not trying to gzip files as in the example, but I would like to read files directly from HDFS into C++ streaming code, as opposed to passing those files as input through the streaming input interface (stdin).
I'm not sure how to reference HDFS from C++ though. I mean, how would one open an ifstream to such a file?
________________________________________________________________________________
Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."
-- Yoda
________________________________________________________________________________
Re: Direct HDFS access from a streaming job
Posted by Keith Wiley <kw...@keithwiley.com>.
On Mar 24, 2011, at 8:31 AM, Harsh J wrote:
> Hello,
>
> On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> Thanks. Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".
>
> Some would consider that a bit ugly in C/C++, but its almost the same
> thing I guess. For shell programs passed to streaming though, its a
> good solution.
I agree. I like the API you found. I was just saying that after posting my question, all references I found online that seemed to address that passage in the docs described the solution I then posted, that of directly invoking "hadoop get".
________________________________________________________________________________
Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com
"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
-- Galileo Galilei
________________________________________________________________________________
Re: Direct HDFS access from a streaming job
Posted by Harsh J <qw...@gmail.com>.
Hello,
On Thu, Mar 24, 2011 at 8:45 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> Thanks. Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".
Some would consider that a bit ugly in C/C++, but its almost the same
thing I guess. For shell programs passed to streaming though, its a
good solution.
--
Harsh J
http://harshj.com
Re: Direct HDFS access from a streaming job
Posted by Keith Wiley <kw...@keithwiley.com>.
On Mar 23, 2011, at 11:10 PM, Harsh J wrote:
> There is a C-HDFS API + library (called libhdfs) available @
> http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you
> can make your C++ mapper program use that?
Thanks. Actually, I think that with reference to the passage I quoted in my first post, the unstated intent was to simply do a system() call and invoke "hadoop fs -get" or "hadoop fs -copyToLocal".
________________________________________________________________________________
Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com
"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
-- Keith Wiley
________________________________________________________________________________
Re: Direct HDFS access from a streaming job
Posted by Harsh J <qw...@gmail.com>.
There is a C-HDFS API + library (called libhdfs) available @
http://hadoop.apache.org/common/docs/r0.20.2/libhdfs.html. Perhaps you
can make your C++ mapper program use that?
On Thu, Mar 24, 2011 at 10:56 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> This webpage:
>
> http://hadoop.apache.org/common/docs/r0.18.3/streaming.html
>
> contains this passage:
>
> [BEGIN]
> How do I process files, one per map?
>
> As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods:
>
> • Hadoop Streaming and custom mapper script:
> • Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
> • Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
> [END]
>
> I'm not trying to gzip files as in the example, but I would like to read files directly from HDFS into C++ streaming code, as opposed to passing those files as input through the streaming input interface (stdin).
>
> I'm not sure how to reference HDFS from C++ though. I mean, how would one open an ifstream to such a file?
>
> ________________________________________________________________________________
> Keith Wiley kwiley@keithwiley.com keithwiley.com music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
> -- Yoda
> ________________________________________________________________________________
>
>
--
Harsh J
http://harshj.com