You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Youssef Hatem <yo...@rwth-aachen.de> on 2013/10/09 13:13:53 UTC

Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem

Re: Problem with streaming exact binary chunks

Posted by Youssef Hatem <yo...@rwth-aachen.de>.

Hello,

Thanks a lot for the information. It helped me figure out the solution of this problem.

I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested.

Best regards,
Youssef Hatem

On Oct 9, 2013, at 14:08 , Peter Marron wrote:

> Hi,
> 
> The only way that I could find was to override the various InputWriter and OutputWriter classes.
> as defined by the configuration settings
> stream.map.input.writer.class
> stream.map.output.reader.class
> stream.reduce.input.writer.class
> stream.reduce. output.reader.class
> which was painful. Hopefully someone will tell you the _correct_ way to do this.
> If not I will provide more details.
> 
> Regards,
> 
> Peter Marron
> Trillium Software UK Limited
> 
> Tel : +44 (0) 118 940 7609
> Fax : +44 (0) 118 940 7699
> E: Peter.Marron@TrilliumSoftware.com
> 
> -----Original Message-----
> From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
> Sent: 09 October 2013 12:14
> To: user@hadoop.apache.org
> Subject: Problem with streaming exact binary chunks
> 
> Hello,
> 
> I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:
> 
> public class MyRecordReader implements
>        RecordReader<BytesWritable, BytesWritable> {
>    ...
>    public boolean next(BytesWritable key, BytesWritable ignore)
>            throws IOException {
>        ...
> 
>        byte[] result = new byte[8];
>        for (int i = 0; i < result.length; ++i)
>            result[i] = (byte)(i+1);
>        result[3] = (byte)'\n';
>        result[4] = (byte)'\n';
> 
>        key.set(result, 0, result.length);
>        return true;
>    }
> }
> 
> As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).
> 
> According to the documentation of typed bytes the mapper should receive the following byte sequence: 
> 00 00 00 08 01 02 03 0a 0a 06 07 08
> 
> However bytes are somehow modified and I get the following sequence instead:
> 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08
> 
> 0a = '\n'
> 09 = '\t'
> 
> It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.
> 
> Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.
> 
> Best regards,
> Youssef Hatem

Re: Problem with streaming exact binary chunks

Posted by Youssef Hatem <yo...@rwth-aachen.de>.

Hello,

Thanks a lot for the information. It helped me figure out the solution of this problem.

I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested.

Best regards,
Youssef Hatem

On Oct 9, 2013, at 14:08 , Peter Marron wrote:

> Hi,
> 
> The only way that I could find was to override the various InputWriter and OutputWriter classes.
> as defined by the configuration settings
> stream.map.input.writer.class
> stream.map.output.reader.class
> stream.reduce.input.writer.class
> stream.reduce. output.reader.class
> which was painful. Hopefully someone will tell you the _correct_ way to do this.
> If not I will provide more details.
> 
> Regards,
> 
> Peter Marron
> Trillium Software UK Limited
> 
> Tel : +44 (0) 118 940 7609
> Fax : +44 (0) 118 940 7699
> E: Peter.Marron@TrilliumSoftware.com
> 
> -----Original Message-----
> From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
> Sent: 09 October 2013 12:14
> To: user@hadoop.apache.org
> Subject: Problem with streaming exact binary chunks
> 
> Hello,
> 
> I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:
> 
> public class MyRecordReader implements
>        RecordReader<BytesWritable, BytesWritable> {
>    ...
>    public boolean next(BytesWritable key, BytesWritable ignore)
>            throws IOException {
>        ...
> 
>        byte[] result = new byte[8];
>        for (int i = 0; i < result.length; ++i)
>            result[i] = (byte)(i+1);
>        result[3] = (byte)'\n';
>        result[4] = (byte)'\n';
> 
>        key.set(result, 0, result.length);
>        return true;
>    }
> }
> 
> As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).
> 
> According to the documentation of typed bytes the mapper should receive the following byte sequence: 
> 00 00 00 08 01 02 03 0a 0a 06 07 08
> 
> However bytes are somehow modified and I get the following sequence instead:
> 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08
> 
> 0a = '\n'
> 09 = '\t'
> 
> It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.
> 
> Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.
> 
> Best regards,
> Youssef Hatem

Re: Problem with streaming exact binary chunks

Posted by Youssef Hatem <yo...@rwth-aachen.de>.

Hello,

Thanks a lot for the information. It helped me figure out the solution of this problem.

I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested.

Best regards,
Youssef Hatem

On Oct 9, 2013, at 14:08 , Peter Marron wrote:

> Hi,
> 
> The only way that I could find was to override the various InputWriter and OutputWriter classes.
> as defined by the configuration settings
> stream.map.input.writer.class
> stream.map.output.reader.class
> stream.reduce.input.writer.class
> stream.reduce. output.reader.class
> which was painful. Hopefully someone will tell you the _correct_ way to do this.
> If not I will provide more details.
> 
> Regards,
> 
> Peter Marron
> Trillium Software UK Limited
> 
> Tel : +44 (0) 118 940 7609
> Fax : +44 (0) 118 940 7699
> E: Peter.Marron@TrilliumSoftware.com
> 
> -----Original Message-----
> From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
> Sent: 09 October 2013 12:14
> To: user@hadoop.apache.org
> Subject: Problem with streaming exact binary chunks
> 
> Hello,
> 
> I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:
> 
> public class MyRecordReader implements
>        RecordReader<BytesWritable, BytesWritable> {
>    ...
>    public boolean next(BytesWritable key, BytesWritable ignore)
>            throws IOException {
>        ...
> 
>        byte[] result = new byte[8];
>        for (int i = 0; i < result.length; ++i)
>            result[i] = (byte)(i+1);
>        result[3] = (byte)'\n';
>        result[4] = (byte)'\n';
> 
>        key.set(result, 0, result.length);
>        return true;
>    }
> }
> 
> As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).
> 
> According to the documentation of typed bytes the mapper should receive the following byte sequence: 
> 00 00 00 08 01 02 03 0a 0a 06 07 08
> 
> However bytes are somehow modified and I get the following sequence instead:
> 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08
> 
> 0a = '\n'
> 09 = '\t'
> 
> It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.
> 
> Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.
> 
> Best regards,
> Youssef Hatem

Re: Problem with streaming exact binary chunks

Posted by Youssef Hatem <yo...@rwth-aachen.de>.

Hello,

Thanks a lot for the information. It helped me figure out the solution of this problem.

I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested.

Best regards,
Youssef Hatem

On Oct 9, 2013, at 14:08 , Peter Marron wrote:

> Hi,
> 
> The only way that I could find was to override the various InputWriter and OutputWriter classes.
> as defined by the configuration settings
> stream.map.input.writer.class
> stream.map.output.reader.class
> stream.reduce.input.writer.class
> stream.reduce. output.reader.class
> which was painful. Hopefully someone will tell you the _correct_ way to do this.
> If not I will provide more details.
> 
> Regards,
> 
> Peter Marron
> Trillium Software UK Limited
> 
> Tel : +44 (0) 118 940 7609
> Fax : +44 (0) 118 940 7699
> E: Peter.Marron@TrilliumSoftware.com
> 
> -----Original Message-----
> From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
> Sent: 09 October 2013 12:14
> To: user@hadoop.apache.org
> Subject: Problem with streaming exact binary chunks
> 
> Hello,
> 
> I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:
> 
> public class MyRecordReader implements
>        RecordReader<BytesWritable, BytesWritable> {
>    ...
>    public boolean next(BytesWritable key, BytesWritable ignore)
>            throws IOException {
>        ...
> 
>        byte[] result = new byte[8];
>        for (int i = 0; i < result.length; ++i)
>            result[i] = (byte)(i+1);
>        result[3] = (byte)'\n';
>        result[4] = (byte)'\n';
> 
>        key.set(result, 0, result.length);
>        return true;
>    }
> }
> 
> As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).
> 
> According to the documentation of typed bytes the mapper should receive the following byte sequence: 
> 00 00 00 08 01 02 03 0a 0a 06 07 08
> 
> However bytes are somehow modified and I get the following sequence instead:
> 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08
> 
> 0a = '\n'
> 09 = '\t'
> 
> It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.
> 
> Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.
> 
> Best regards,
> Youssef Hatem

RE: Problem with streaming exact binary chunks

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com

-----Original Message-----
From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
Sent: 09 October 2013 12:14
To: user@hadoop.apache.org
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem

RE: Problem with streaming exact binary chunks

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com

-----Original Message-----
From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
Sent: 09 October 2013 12:14
To: user@hadoop.apache.org
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem

RE: Problem with streaming exact binary chunks

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com

-----Original Message-----
From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
Sent: 09 October 2013 12:14
To: user@hadoop.apache.org
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem

RE: Problem with streaming exact binary chunks

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

The only way that I could find was to override the various InputWriter and OutputWriter classes.
as defined by the configuration settings
stream.map.input.writer.class
stream.map.output.reader.class
stream.reduce.input.writer.class
stream.reduce. output.reader.class
which was painful. Hopefully someone will tell you the _correct_ way to do this.
If not I will provide more details.

Regards,

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: Peter.Marron@TrilliumSoftware.com

-----Original Message-----
From: Youssef Hatem [mailto:youssef.hatem@rwth-aachen.de] 
Sent: 09 October 2013 12:14
To: user@hadoop.apache.org
Subject: Problem with streaming exact binary chunks

Hello,

I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send:

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes).

According to the documentation of typed bytes the mapper should receive the following byte sequence: 
00 00 00 08 01 02 03 0a 0a 06 07 08

However bytes are somehow modified and I get the following sequence instead:
00 00 00 08 01 02 03 09 0a 09 0a 06 07 08

0a = '\n'
09 = '\t'

It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume.

Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance.

Best regards,
Youssef Hatem