You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by GorGo <gy...@ru.is> on 2012/01/10 17:31:00 UTC

Hadoop PIPES job using C++ and binary data results in data locality problem.

Hi everyone.

I am running C++ code using the PIPES wrapper and I am looking for some
tutorials, examples or any kind of help with regards to using binary data. 
My problems is that I am working with large chunks of binary data and
converting it to text not an option.
My first question is thus, can I pass large chunks (>128 MB) of binary data
through the PIPES interface? 
I have not been able to find documentation on this.

The way I do things now is that I bypass the Hadoop process by opening and
reading the data directly from the C++ code using the HDFS C API. However,
that means that I lose the data locality and causes too much network
overhead to be viable at large scale.

If passing binary data directly is not possible with PIPES, I need somehow
to write my own RecordReader that maintains the data locality but still does
not actually emit the data, (I just need to make sure the c++ mapper reads
the same data from a local source when it is spawned). 
The recordreader actually does not need to read the data at all. Generating
a config string that tells the C++ mapper code what to read would be just
fine. 

The second question is thus, how to write my own RecordReader in the C++ or
JAVA? 
I also would like information on how Hadoop maintains the data locality
between RecordReaders and the spawned map tasks. 

Any information is most welcome. 

Regards 
   GorGo
-- 
View this message in context: http://old.nabble.com/Hadoop-PIPES-job-using-C%2B%2B-and-binary-data-results-in-data-locality-problem.-tp33112818p33112818.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Hadoop PIPES job using C++ and binary data results in data locality problem.

Posted by Robert Evans <ev...@yahoo-inc.com>.
I think what you want to try and do is to use JNI rather then pipes or streaming.  PIPES has known issues and it is my understanding that its use is now discouraged.  The ideal way to do this is to use JNI to send your data to the C code.  Be aware that moving large amounts of data through JNI has some of its own challenges, but most of these can be solved by using Direct ByteBuffer.

--Bobby Evans

On 1/10/12 10:31 AM, "GorGo" <gy...@ru.is> wrote:



Hi everyone.

I am running C++ code using the PIPES wrapper and I am looking for some
tutorials, examples or any kind of help with regards to using binary data.
My problems is that I am working with large chunks of binary data and
converting it to text not an option.
My first question is thus, can I pass large chunks (>128 MB) of binary data
through the PIPES interface?
I have not been able to find documentation on this.

The way I do things now is that I bypass the Hadoop process by opening and
reading the data directly from the C++ code using the HDFS C API. However,
that means that I lose the data locality and causes too much network
overhead to be viable at large scale.

If passing binary data directly is not possible with PIPES, I need somehow
to write my own RecordReader that maintains the data locality but still does
not actually emit the data, (I just need to make sure the c++ mapper reads
the same data from a local source when it is spawned).
The recordreader actually does not need to read the data at all. Generating
a config string that tells the C++ mapper code what to read would be just
fine.

The second question is thus, how to write my own RecordReader in the C++ or
JAVA?
I also would like information on how Hadoop maintains the data locality
between RecordReaders and the spawned map tasks.

Any information is most welcome.

Regards
   GorGo
--
View this message in context: http://old.nabble.com/Hadoop-PIPES-job-using-C%2B%2B-and-binary-data-results-in-data-locality-problem.-tp33112818p33112818.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.