You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2008/04/10 09:42:05 UTC

[jira] Commented: (HADOOP-3227) Implement a binary input/output format for Streaming

    [ https://issues.apache.org/jira/browse/HADOOP-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587512#action_12587512 ] 

Owen O'Malley commented on HADOOP-3227:
---------------------------------------

I don't see the problem with using TextInputFormat. After HADOOP-2285, the TextInputFormat will move binary data straight from the file into the Text object. Streaming needs to be changed to get the bytes from the Text and move them straight to the application without converting to a string. That would also speed up streaming...

> Implement a binary input/output format for Streaming
> ----------------------------------------------------
>
>                 Key: HADOOP-3227
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3227
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.18.0
>
>
> Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.
> I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.