You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Klaas Bosteels (JIRA)" <ji...@apache.org> on 2009/01/26 18:25:59 UTC
[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

     [ https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaas Bosteels updated HADOOP-1722:
-----------------------------------

    Attachment: HADOOP-1722.patch

Are there any comments on the attached patch? It basically implements an extended version of Eric's idea concerning the addition of an option that triggers the usage of a new binary format. However, instead of a 4 byte length it uses a 1 byte type code (and the number of following bytes is derived from this type code). This leads to a slightly more compact representation for basic types (e.g. a float requires 1 + 4 bytes instead of 4 + 4 bytes), and it also solves another important Streaming issue, namely, that all type information is lost when everything is converted to strings.


h5. Contents

The patch consists of the following parts:
* A new package {{org.apache.hadoop.typedbytes}} in {{src/core}} that provides functionality for dealing with sequences of bytes in which the first byte is a type code. This package also includes classes that can convert Writables to/from typed bytes and (de)serialize Records to/from typed bytes. The typed bytes format itself was kept as simple and straightforward as possible in order to make it very easy to write conversion code in other languages.
* Changes to Streaming that add the {{-typedbytes none|input|output|all}} option. When typed bytes are requested for the input, the functionality provided by the package {{org.apache.hadoop.typedbytes}} is used to convert all input Writables to typed bytes (which makes it possible to let Streaming programs seamlessly take sequence files containing Records and/or other Writables as input), and when typed bytes are used for the output, Streaming outputs {{TypedBytesWritables}} (i.e. instances of the {{org.apache.hadoop.typedbytes.TypedBytesWritable}} class, which extends {{BytesWritable}}).
* A new tool {{DumpTypedBytes}} in {{src/tools}} that dumps DFS files as typed bytes to stdout. This can often be a lot more convenient than printing out the strings returned by the {{toString()}} methods, and it can also be used to fetch an input sample from the DFS for testing Streaming programs that use typed bytes.
* A new input format called {{AutoInputFormat}}, which can take text files as well as sequence files (or both at the same time) as input. The functionality to deal with text and sequence files transparantly was required for the {{DumpTypedBytes}} tool, and putting it in an input format makes sense since the ability to take both text and sequence files as input can be very useful for Streaming programs. Because Streaming still uses the old mapred API, the patch includes two versions of {{AutoInputFormat}} (one for the old and another for the new API).


h5. Example 

Using the simple Python module available at http://github.com/klbostee/typedbytes, the mapper script
{code}
import sys
import typedbytes

input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)

for (key, value) in input:
    for word in value.split():
        output.write((word, 1))
{code}
and the reducer script
{code}
import sys
import typedbytes
from itertools import groupby
from operator import itemgetter

input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)

for (key, group) in groupby(input, itemgetter(0)):
    values = map(itemgetter(1), group)
    output.write((key, sum(values)))
{code}
can be used to do a simple wordcount. The unit tests include a similar example in Java.


h5. Remark

This patch renders HADOOP-4304 mostly obsolete, since it provides all underlying functionality required for Dumbo. If this patch gets accepted, then future versions of Dumbo will probably only consists of Python code again and thus be very easy to install and use, which makes adding Dumbo to contrib less of requirement.

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Christopher Zimmerman
>         Attachments: HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding protocol, 
> they can output arabitary bytearray and the streaming framework can handle them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.