You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Dmitry Sivachenko <tr...@gmail.com> on 2014/09/10 17:51:40 UTC

Writing output from streaming task without dealing with key/value

Hello!

Imagine the following common task: I want to process big text file line-by-line using streaming interface.
Run unix grep command for instance.  Or some other line-by-line processing, e.g. line.upper().
I copy file to HDFS.

Then I run a map task on this file which reads one line, modifies it some way and then writes it to the output.

TextInputFormat suites well for reading: it's key is the offset in bytes (meaningless in my case) and the value is the line itself, so I can iterate over line like this (in python):
for line in sys.stdin:
  print(line.upper())

The problem arises with TextOutputFormat:  It tries to split the resulting line on mapreduce.output.textoutputformat.separator which results in extra separator in output if this character is missing in the line, for instance (extra TAB at the end if we stick to defaults).

Is there any way to write the result of streaming task without any internal processing so it appears exactly as the script produces it?

If it is impossible with Hadoop, which works with key/value pairs, may be there are other frameworks which work on top of HDFS which allow to do this?

Thanks in advance!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.


> 10 ����. 2014 �., � 22:47, Shahab Yunus <sh...@gmail.com> �������(�):
> 
> Examples (the top ones are related to streaming jobs):
> 
> http://www.infoq.com/articles/HadoopOutputFormat
> http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
> http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
> 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: key and value. If one of them is empty I have no way to tell if I should output the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


> Regards,
> Shahab
> 
>> On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
>> 
>> On 10 ����. 2014 �., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>> 
>> > You can write a custom output format
>> 
>> 
>> Any clues how can this can be done?
>> 
>> 
>> 
>> > , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.
>> >
>> > grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands.
>> 
>> 
>> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
>> So it is not that simple.
>> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
>> 
>> Thanks!
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.


> 10 ����. 2014 �., � 22:47, Shahab Yunus <sh...@gmail.com> �������(�):
> 
> Examples (the top ones are related to streaming jobs):
> 
> http://www.infoq.com/articles/HadoopOutputFormat
> http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
> http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
> 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: key and value. If one of them is empty I have no way to tell if I should output the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


> Regards,
> Shahab
> 
>> On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
>> 
>> On 10 ����. 2014 �., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>> 
>> > You can write a custom output format
>> 
>> 
>> Any clues how can this can be done?
>> 
>> 
>> 
>> > , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.
>> >
>> > grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands.
>> 
>> 
>> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
>> So it is not that simple.
>> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
>> 
>> Thanks!
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.


> 10 сент. 2014 г., в 22:47, Shahab Yunus <sh...@gmail.com> написал(а):
> 
> Examples (the top ones are related to streaming jobs):
> 
> http://www.infoq.com/articles/HadoopOutputFormat
> http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
> http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
> 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: key and value. If one of them is empty I have no way to tell if I should output the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


> Regards,
> Shahab
> 
>> On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
>> 
>> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>> 
>> > You can write a custom output format
>> 
>> 
>> Any clues how can this can be done?
>> 
>> 
>> 
>> > , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.
>> >
>> > grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands.
>> 
>> 
>> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
>> So it is not that simple.
>> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
>> 
>> Thanks!
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.


> 10 сент. 2014 г., в 22:47, Shahab Yunus <sh...@gmail.com> написал(а):
> 
> Examples (the top ones are related to streaming jobs):
> 
> http://www.infoq.com/articles/HadoopOutputFormat
> http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
> http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application
> 


Thanks for the links.  Problem is that in RecordWriter() I get two parameters: key and value. If one of them is empty I have no way to tell if I should output the delimiter (because it was present in the original line) or not.

What is the proper way to workaround that isuue?


> Regards,
> Shahab
> 
>> On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
>> 
>> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>> 
>> > You can write a custom output format
>> 
>> 
>> Any clues how can this can be done?
>> 
>> 
>> 
>> > , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.
>> >
>> > grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands.
>> 
>> 
>> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
>> So it is not that simple.
>> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
>> 
>> Thanks!
>

Re: Writing output from streaming task without dealing with key/value

Posted by Shahab Yunus <sh...@gmail.com>.

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>
> > You can write a custom output format
>
>
> Any clues how can this can be done?
>
>
>
> > , or you can write your mapreduce job in Java and use a NullWritable as
> Susheel recommended.
> >
> > grep (and every other *nix text processing command) I can think of would
> not be limited by a trailing tab character.  It's even quite easy to strip
> away that tab character if you don't want it during the post processing
> steps you want to perform with *nix commands.
>
>
> Problem is that the line itself contains a TAB in the middle, there will
> not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB
> added by TextOutputFormat.
>
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

Okay, FWIW I found the solution:

https://issues.apache.org/jira/browse/MAPREDUCE-6085

Thanks for all who replied.


On 11 сент. 2014 г., at 11:16, Dmitry Sivachenko <tr...@gmail.com> wrote:

> After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
> Can anyone point me to that piece of code please?
> 
> Thanks!
> 
> On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

Okay, FWIW I found the solution:

https://issues.apache.org/jira/browse/MAPREDUCE-6085

Thanks for all who replied.


On 11 сент. 2014 г., at 11:16, Dmitry Sivachenko <tr...@gmail.com> wrote:

> After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
> Can anyone point me to that piece of code please?
> 
> Thanks!
> 
> On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

Okay, FWIW I found the solution:

https://issues.apache.org/jira/browse/MAPREDUCE-6085

Thanks for all who replied.


On 11 сент. 2014 г., at 11:16, Dmitry Sivachenko <tr...@gmail.com> wrote:

> After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
> Can anyone point me to that piece of code please?
> 
> Thanks!
> 
> On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

Okay, FWIW I found the solution:

https://issues.apache.org/jira/browse/MAPREDUCE-6085

Thanks for all who replied.


On 11 сент. 2014 г., at 11:16, Dmitry Sivachenko <tr...@gmail.com> wrote:

> After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
> Can anyone point me to that piece of code please?
> 
> Thanks!
> 
> On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
Can anyone point me to that piece of code please?

Thanks!

On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 11 сент. 2014 г., at 0:47, Felix Chern <id...@gmail.com> wrote:

> If you don’t want anything get inserted, just set your output to key only or value only.
> TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:


If I output value only, for instance, and my line contains TAB then everything before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will be lost?


> 
>     public synchronized void write(K key, V value)
>       throws IOException {
> 
>       boolean nullKey = key == null || key instanceof NullWritable;
>       boolean nullValue = value == null || value instanceof NullWritable;
>       if (nullKey && nullValue) {
>         return;
>       }
>       if (!nullKey) {
>         writeObject(key);
>       }
>       if (!(nullKey || nullValue)) {
>         out.write(keyValueSeparator);
>       }
>       if (!nullValue) {
>         writeObject(value);
>       }
>       out.write(newline);
>     }
> 
> On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 11 сент. 2014 г., at 0:47, Felix Chern <id...@gmail.com> wrote:

> If you don’t want anything get inserted, just set your output to key only or value only.
> TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:


If I output value only, for instance, and my line contains TAB then everything before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will be lost?


> 
>     public synchronized void write(K key, V value)
>       throws IOException {
> 
>       boolean nullKey = key == null || key instanceof NullWritable;
>       boolean nullValue = value == null || value instanceof NullWritable;
>       if (nullKey && nullValue) {
>         return;
>       }
>       if (!nullKey) {
>         writeObject(key);
>       }
>       if (!(nullKey || nullValue)) {
>         out.write(keyValueSeparator);
>       }
>       if (!nullValue) {
>         writeObject(value);
>       }
>       out.write(newline);
>     }
> 
> On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 11 сент. 2014 г., at 0:47, Felix Chern <id...@gmail.com> wrote:

> If you don’t want anything get inserted, just set your output to key only or value only.
> TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:


If I output value only, for instance, and my line contains TAB then everything before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will be lost?


> 
>     public synchronized void write(K key, V value)
>       throws IOException {
> 
>       boolean nullKey = key == null || key instanceof NullWritable;
>       boolean nullValue = value == null || value instanceof NullWritable;
>       if (nullKey && nullValue) {
>         return;
>       }
>       if (!nullKey) {
>         writeObject(key);
>       }
>       if (!(nullKey || nullValue)) {
>         out.write(keyValueSeparator);
>       }
>       if (!nullValue) {
>         writeObject(value);
>       }
>       out.write(newline);
>     }
> 
> On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 11 сент. 2014 г., at 0:47, Felix Chern <id...@gmail.com> wrote:

> If you don’t want anything get inserted, just set your output to key only or value only.
> TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:


If I output value only, for instance, and my line contains TAB then everything before TAB will be lost?
If I output key only, and my line contains TAB then everything after TAB will be lost?


> 
>     public synchronized void write(K key, V value)
>       throws IOException {
> 
>       boolean nullKey = key == null || key instanceof NullWritable;
>       boolean nullValue = value == null || value instanceof NullWritable;
>       if (nullKey && nullValue) {
>         return;
>       }
>       if (!nullKey) {
>         writeObject(key);
>       }
>       if (!(nullKey || nullValue)) {
>         out.write(keyValueSeparator);
>       }
>       if (!nullValue) {
>         writeObject(value);
>       }
>       out.write(newline);
>     }
> 
> On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:
> 
>> 
>> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
>> 
>>> Use ‘tr -s’ to stripe out tabs?
>>> 
>>> $ echo -e "a\t\t\tb"
>>> a			b
>>> 
>>> $ echo -e "a\t\t\tb" | tr -s "\t"
>>> a	b
>>> 
>> 
>> There can be tabs in the input, I want to keep input lines without any modification.
>> 
>> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>> 
>

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

If you don’t want anything get inserted, just set your output to key only or value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

If you don’t want anything get inserted, just set your output to key only or value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
Can anyone point me to that piece of code please?

Thanks!

On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

If you don’t want anything get inserted, just set your output to key only or value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

If you don’t want anything get inserted, just set your output to key only or value only.
TextOutputFormat$LineRecordWriter won’t insert anything unless both values are set:

    public synchronized void write(K key, V value)
      throws IOException {

      boolean nullKey = key == null || key instanceof NullWritable;
      boolean nullValue = value == null || value instanceof NullWritable;
      if (nullKey && nullValue) {
        return;
      }
      if (!nullKey) {
        writeObject(key);
      }
      if (!(nullKey || nullValue)) {
        out.write(keyValueSeparator);
      }
      if (!nullValue) {
        writeObject(value);
      }
      out.write(newline);
    }

On Sep 10, 2014, at 1:37 PM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
Can anyone point me to that piece of code please?

Thanks!

On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

After streaming job outputs some data to stdout, some hadoop code receives it and splits into key/value pair before it reaches TextOutputFormat.
Can anyone point me to that piece of code please?

Thanks!

On 11 сент. 2014 г., at 0:37, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:
> 
>> Use ‘tr -s’ to stripe out tabs?
>> 
>> $ echo -e "a\t\t\tb"
>> a			b
>> 
>> $ echo -e "a\t\t\tb" | tr -s "\t"
>> a	b
>> 
> 
> There can be tabs in the input, I want to keep input lines without any modification.
> 
> Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.
>

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:

> Use ‘tr -s’ to stripe out tabs?
> 
>  $ echo -e "a\t\t\tb"
> a			b
> 
>  $ echo -e "a\t\t\tb" | tr -s "\t"
> a	b
> 

There can be tabs in the input, I want to keep input lines without any modification.

Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:

> Use ‘tr -s’ to stripe out tabs?
> 
>  $ echo -e "a\t\t\tb"
> a			b
> 
>  $ echo -e "a\t\t\tb" | tr -s "\t"
> a	b
> 

There can be tabs in the input, I want to keep input lines without any modification.

Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:

> Use ‘tr -s’ to stripe out tabs?
> 
>  $ echo -e "a\t\t\tb"
> a			b
> 
>  $ echo -e "a\t\t\tb" | tr -s "\t"
> a	b
> 

There can be tabs in the input, I want to keep input lines without any modification.

Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:33, Felix Chern <id...@gmail.com> wrote:

> Use ‘tr -s’ to stripe out tabs?
> 
>  $ echo -e "a\t\t\tb"
> a			b
> 
>  $ echo -e "a\t\t\tb" | tr -s "\t"
> a	b
> 

There can be tabs in the input, I want to keep input lines without any modification.

Actually it is rather standard task: process lines one by one without inserting extra characters.  There should be standard solution for it IMO.

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

Use ‘tr -s’ to stripe out tabs?

 $ echo -e "a\t\t\tb"
a			b

 $ echo -e "a\t\t\tb" | tr -s "\t"
a	b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
> 
>> You can write a custom output format
> 
> 
> Any clues how can this can be done?
> 
> 
> 
>> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
>> 
>> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 
> 
> 
> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
> 
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Shahab Yunus <sh...@gmail.com>.

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>
> > You can write a custom output format
>
>
> Any clues how can this can be done?
>
>
>
> > , or you can write your mapreduce job in Java and use a NullWritable as
> Susheel recommended.
> >
> > grep (and every other *nix text processing command) I can think of would
> not be limited by a trailing tab character.  It's even quite easy to strip
> away that tab character if you don't want it during the post processing
> steps you want to perform with *nix commands.
>
>
> Problem is that the line itself contains a TAB in the middle, there will
> not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB
> added by TextOutputFormat.
>
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

Use ‘tr -s’ to stripe out tabs?

 $ echo -e "a\t\t\tb"
a			b

 $ echo -e "a\t\t\tb" | tr -s "\t"
a	b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
> 
>> You can write a custom output format
> 
> 
> Any clues how can this can be done?
> 
> 
> 
>> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
>> 
>> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 
> 
> 
> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
> 
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

Use ‘tr -s’ to stripe out tabs?

 $ echo -e "a\t\t\tb"
a			b

 $ echo -e "a\t\t\tb" | tr -s "\t"
a	b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
> 
>> You can write a custom output format
> 
> 
> Any clues how can this can be done?
> 
> 
> 
>> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
>> 
>> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 
> 
> 
> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
> 
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Shahab Yunus <sh...@gmail.com>.

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>
> > You can write a custom output format
>
>
> Any clues how can this can be done?
>
>
>
> > , or you can write your mapreduce job in Java and use a NullWritable as
> Susheel recommended.
> >
> > grep (and every other *nix text processing command) I can think of would
> not be limited by a trailing tab character.  It's even quite easy to strip
> away that tab character if you don't want it during the post processing
> steps you want to perform with *nix commands.
>
>
> Problem is that the line itself contains a TAB in the middle, there will
> not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB
> added by TextOutputFormat.
>
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Felix Chern <id...@gmail.com>.

Use ‘tr -s’ to stripe out tabs?

 $ echo -e "a\t\t\tb"
a			b

 $ echo -e "a\t\t\tb" | tr -s "\t"
a	b


On Sep 10, 2014, at 11:28 AM, Dmitry Sivachenko <tr...@gmail.com> wrote:

> 
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
> 
>> You can write a custom output format
> 
> 
> Any clues how can this can be done?
> 
> 
> 
>> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
>> 
>> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 
> 
> 
> Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.
> 
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Shahab Yunus <sh...@gmail.com>.

Examples (the top ones are related to streaming jobs):

http://www.infoq.com/articles/HadoopOutputFormat
http://research.neustar.biz/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/
http://stackoverflow.com/questions/12759651/how-to-override-inputformat-and-outputformat-in-hadoop-application

Regards,
Shahab

On Wed, Sep 10, 2014 at 2:28 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:
>
> > You can write a custom output format
>
>
> Any clues how can this can be done?
>
>
>
> > , or you can write your mapreduce job in Java and use a NullWritable as
> Susheel recommended.
> >
> > grep (and every other *nix text processing command) I can think of would
> not be limited by a trailing tab character.  It's even quite easy to strip
> away that tab character if you don't want it during the post processing
> steps you want to perform with *nix commands.
>
>
> Problem is that the line itself contains a TAB in the middle, there will
> not be extra trailing TAB at the end.
> So it is not that simple.
> You never know if it is a TAB from the original line or it is extra TAB
> added by TextOutputFormat.
>
> Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:

> You can write a custom output format

Any clues how can this can be done?

> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
> 
> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 

Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:

> You can write a custom output format

Any clues how can this can be done?

> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
> 
> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 

Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:

> You can write a custom output format

Any clues how can this can be done?

> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
> 
> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 

Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:19, Rich Haase <rd...@gmail.com> wrote:

> You can write a custom output format

Any clues how can this can be done?

> , or you can write your mapreduce job in Java and use a NullWritable as Susheel recommended.  
> 
> grep (and every other *nix text processing command) I can think of would not be limited by a trailing tab character.  It's even quite easy to strip away that tab character if you don't want it during the post processing steps you want to perform with *nix commands. 

Problem is that the line itself contains a TAB in the middle, there will not be extra trailing TAB at the end.
So it is not that simple.
You never know if it is a TAB from the original line or it is extra TAB added by TextOutputFormat.

Thanks!

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:
>
> > In python, or any streaming program just set the output value to the
> empty string and you will get something like "key"\t"".
> >
>
>
> I see, but I want to use many existing programs (like UNIX grep), and I
> don't want to have and extra "\t" in the output.
>
> Is there any way to achieve this?  Or may be it is possible to write
> custom XxxOutputFormat to workaround that issue?
>
> (something opposite to TextInputFormat: it passes input line without any
> modification to script's stdin, there should be a way to write stdout to
> file "as is").
>
>
> Thanks!
>
>
> > On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <
> skgadalay@gmail.com> wrote:
> > If you don't want key in the final output, you can set like this in Java.
> >
> > job.setOutputKeyClass(NullWritable.class);
> >
> > It will just print the value in the output file.
> >
> > I don't how to do it in python.
> >
> > On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > > Hello!
> > >
> > > Imagine the following common task: I want to process big text file
> > > line-by-line using streaming interface.
> > > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > > e.g. line.upper().
> > > I copy file to HDFS.
> > >
> > > Then I run a map task on this file which reads one line, modifies it
> some
> > > way and then writes it to the output.
> > >
> > > TextInputFormat suites well for reading: it's key is the offset in
> bytes
> > > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > > over line like this (in python):
> > > for line in sys.stdin:
> > >   print(line.upper())
> > >
> > > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > > separator in output if this character is missing in the line, for
> instance
> > > (extra TAB at the end if we stick to defaults).
> > >
> > > Is there any way to write the result of streaming task without any
> internal
> > > processing so it appears exactly as the script produces it?
> > >
> > > If it is impossible with Hadoop, which works with key/value pairs, may
> be
> > > there are other frameworks which work on top of HDFS which allow to do
> > > this?
> > >
> > > Thanks in advance!
> >
> >
> >
> > --
> > Kernighan's Law
> > "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are, by
> definition, not smart enough to debug it."
>
>


-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:
>
> > In python, or any streaming program just set the output value to the
> empty string and you will get something like "key"\t"".
> >
>
>
> I see, but I want to use many existing programs (like UNIX grep), and I
> don't want to have and extra "\t" in the output.
>
> Is there any way to achieve this?  Or may be it is possible to write
> custom XxxOutputFormat to workaround that issue?
>
> (something opposite to TextInputFormat: it passes input line without any
> modification to script's stdin, there should be a way to write stdout to
> file "as is").
>
>
> Thanks!
>
>
> > On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <
> skgadalay@gmail.com> wrote:
> > If you don't want key in the final output, you can set like this in Java.
> >
> > job.setOutputKeyClass(NullWritable.class);
> >
> > It will just print the value in the output file.
> >
> > I don't how to do it in python.
> >
> > On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > > Hello!
> > >
> > > Imagine the following common task: I want to process big text file
> > > line-by-line using streaming interface.
> > > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > > e.g. line.upper().
> > > I copy file to HDFS.
> > >
> > > Then I run a map task on this file which reads one line, modifies it
> some
> > > way and then writes it to the output.
> > >
> > > TextInputFormat suites well for reading: it's key is the offset in
> bytes
> > > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > > over line like this (in python):
> > > for line in sys.stdin:
> > >   print(line.upper())
> > >
> > > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > > separator in output if this character is missing in the line, for
> instance
> > > (extra TAB at the end if we stick to defaults).
> > >
> > > Is there any way to write the result of streaming task without any
> internal
> > > processing so it appears exactly as the script produces it?
> > >
> > > If it is impossible with Hadoop, which works with key/value pairs, may
> be
> > > there are other frameworks which work on top of HDFS which allow to do
> > > this?
> > >
> > > Thanks in advance!
> >
> >
> >
> > --
> > Kernighan's Law
> > "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are, by
> definition, not smart enough to debug it."
>
>


-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:
>
> > In python, or any streaming program just set the output value to the
> empty string and you will get something like "key"\t"".
> >
>
>
> I see, but I want to use many existing programs (like UNIX grep), and I
> don't want to have and extra "\t" in the output.
>
> Is there any way to achieve this?  Or may be it is possible to write
> custom XxxOutputFormat to workaround that issue?
>
> (something opposite to TextInputFormat: it passes input line without any
> modification to script's stdin, there should be a way to write stdout to
> file "as is").
>
>
> Thanks!
>
>
> > On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <
> skgadalay@gmail.com> wrote:
> > If you don't want key in the final output, you can set like this in Java.
> >
> > job.setOutputKeyClass(NullWritable.class);
> >
> > It will just print the value in the output file.
> >
> > I don't how to do it in python.
> >
> > On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > > Hello!
> > >
> > > Imagine the following common task: I want to process big text file
> > > line-by-line using streaming interface.
> > > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > > e.g. line.upper().
> > > I copy file to HDFS.
> > >
> > > Then I run a map task on this file which reads one line, modifies it
> some
> > > way and then writes it to the output.
> > >
> > > TextInputFormat suites well for reading: it's key is the offset in
> bytes
> > > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > > over line like this (in python):
> > > for line in sys.stdin:
> > >   print(line.upper())
> > >
> > > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > > separator in output if this character is missing in the line, for
> instance
> > > (extra TAB at the end if we stick to defaults).
> > >
> > > Is there any way to write the result of streaming task without any
> internal
> > > processing so it appears exactly as the script produces it?
> > >
> > > If it is impossible with Hadoop, which works with key/value pairs, may
> be
> > > there are other frameworks which work on top of HDFS which allow to do
> > > this?
> > >
> > > Thanks in advance!
> >
> >
> >
> > --
> > Kernighan's Law
> > "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are, by
> definition, not smart enough to debug it."
>
>


-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

You can write a custom output format, or you can write your mapreduce job
in Java and use a NullWritable as Susheel recommended.

grep (and every other *nix text processing command) I can think of would
not be limited by a trailing tab character.  It's even quite easy to strip
away that tab character if you don't want it during the post processing
steps you want to perform with *nix commands.

On Wed, Sep 10, 2014 at 12:12 PM, Dmitry Sivachenko <tr...@gmail.com>
wrote:

>
> On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:
>
> > In python, or any streaming program just set the output value to the
> empty string and you will get something like "key"\t"".
> >
>
>
> I see, but I want to use many existing programs (like UNIX grep), and I
> don't want to have and extra "\t" in the output.
>
> Is there any way to achieve this?  Or may be it is possible to write
> custom XxxOutputFormat to workaround that issue?
>
> (something opposite to TextInputFormat: it passes input line without any
> modification to script's stdin, there should be a way to write stdout to
> file "as is").
>
>
> Thanks!
>
>
> > On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <
> skgadalay@gmail.com> wrote:
> > If you don't want key in the final output, you can set like this in Java.
> >
> > job.setOutputKeyClass(NullWritable.class);
> >
> > It will just print the value in the output file.
> >
> > I don't how to do it in python.
> >
> > On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > > Hello!
> > >
> > > Imagine the following common task: I want to process big text file
> > > line-by-line using streaming interface.
> > > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > > e.g. line.upper().
> > > I copy file to HDFS.
> > >
> > > Then I run a map task on this file which reads one line, modifies it
> some
> > > way and then writes it to the output.
> > >
> > > TextInputFormat suites well for reading: it's key is the offset in
> bytes
> > > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > > over line like this (in python):
> > > for line in sys.stdin:
> > >   print(line.upper())
> > >
> > > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > > separator in output if this character is missing in the line, for
> instance
> > > (extra TAB at the end if we stick to defaults).
> > >
> > > Is there any way to write the result of streaming task without any
> internal
> > > processing so it appears exactly as the script produces it?
> > >
> > > If it is impossible with Hadoop, which works with key/value pairs, may
> be
> > > there are other frameworks which work on top of HDFS which allow to do
> > > this?
> > >
> > > Thanks in advance!
> >
> >
> >
> > --
> > Kernighan's Law
> > "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are, by
> definition, not smart enough to debug it."
>
>


-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:

> In python, or any streaming program just set the output value to the empty string and you will get something like "key"\t"".
> 


I see, but I want to use many existing programs (like UNIX grep), and I don't want to have and extra "\t" in the output.

Is there any way to achieve this?  Or may be it is possible to write custom XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any modification to script's stdin, there should be a way to write stdout to file "as is").


Thanks!


> On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <sk...@gmail.com> wrote:
> If you don't want key in the final output, you can set like this in Java.
> 
> job.setOutputKeyClass(NullWritable.class);
> 
> It will just print the value in the output file.
> 
> I don't how to do it in python.
> 
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the resulting
> > line on mapreduce.output.textoutputformat.separator which results in extra
> > separator in output if this character is missing in the line, for instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
> 
> 
> 
> -- 
> Kernighan's Law
> "Debugging is twice as hard as writing the code in the first place.  Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:

> In python, or any streaming program just set the output value to the empty string and you will get something like "key"\t"".
> 


I see, but I want to use many existing programs (like UNIX grep), and I don't want to have and extra "\t" in the output.

Is there any way to achieve this?  Or may be it is possible to write custom XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any modification to script's stdin, there should be a way to write stdout to file "as is").


Thanks!


> On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <sk...@gmail.com> wrote:
> If you don't want key in the final output, you can set like this in Java.
> 
> job.setOutputKeyClass(NullWritable.class);
> 
> It will just print the value in the output file.
> 
> I don't how to do it in python.
> 
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the resulting
> > line on mapreduce.output.textoutputformat.separator which results in extra
> > separator in output if this character is missing in the line, for instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
> 
> 
> 
> -- 
> Kernighan's Law
> "Debugging is twice as hard as writing the code in the first place.  Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:

> In python, or any streaming program just set the output value to the empty string and you will get something like "key"\t"".
> 


I see, but I want to use many existing programs (like UNIX grep), and I don't want to have and extra "\t" in the output.

Is there any way to achieve this?  Or may be it is possible to write custom XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any modification to script's stdin, there should be a way to write stdout to file "as is").


Thanks!


> On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <sk...@gmail.com> wrote:
> If you don't want key in the final output, you can set like this in Java.
> 
> job.setOutputKeyClass(NullWritable.class);
> 
> It will just print the value in the output file.
> 
> I don't how to do it in python.
> 
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the resulting
> > line on mapreduce.output.textoutputformat.separator which results in extra
> > separator in output if this character is missing in the line, for instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
> 
> 
> 
> -- 
> Kernighan's Law
> "Debugging is twice as hard as writing the code in the first place.  Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Dmitry Sivachenko <tr...@gmail.com>.

On 10 сент. 2014 г., at 22:05, Rich Haase <rd...@gmail.com> wrote:

> In python, or any streaming program just set the output value to the empty string and you will get something like "key"\t"".
> 


I see, but I want to use many existing programs (like UNIX grep), and I don't want to have and extra "\t" in the output.

Is there any way to achieve this?  Or may be it is possible to write custom XxxOutputFormat to workaround that issue?

(something opposite to TextInputFormat: it passes input line without any modification to script's stdin, there should be a way to write stdout to file "as is").


Thanks!


> On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <sk...@gmail.com> wrote:
> If you don't want key in the final output, you can set like this in Java.
> 
> job.setOutputKeyClass(NullWritable.class);
> 
> It will just print the value in the output file.
> 
> I don't how to do it in python.
> 
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the resulting
> > line on mapreduce.output.textoutputformat.separator which results in extra
> > separator in output if this character is missing in the line, for instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
> 
> 
> 
> -- 
> Kernighan's Law
> "Debugging is twice as hard as writing the code in the first place.  Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

In python, or any streaming program just set the output value to the empty
string and you will get something like "key"\t"".

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <skgadalay@gmail.com
> wrote:

> If you don't want key in the final output, you can set like this in Java.
>
> job.setOutputKeyClass(NullWritable.class);
>
> It will just print the value in the output file.
>
> I don't how to do it in python.
>
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > separator in output if this character is missing in the line, for
> instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any
> internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
>



-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

In python, or any streaming program just set the output value to the empty
string and you will get something like "key"\t"".

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <skgadalay@gmail.com
> wrote:

> If you don't want key in the final output, you can set like this in Java.
>
> job.setOutputKeyClass(NullWritable.class);
>
> It will just print the value in the output file.
>
> I don't how to do it in python.
>
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > separator in output if this character is missing in the line, for
> instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any
> internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
>



-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

In python, or any streaming program just set the output value to the empty
string and you will get something like "key"\t"".

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <skgadalay@gmail.com
> wrote:

> If you don't want key in the final output, you can set like this in Java.
>
> job.setOutputKeyClass(NullWritable.class);
>
> It will just print the value in the output file.
>
> I don't how to do it in python.
>
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > separator in output if this character is missing in the line, for
> instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any
> internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
>



-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Rich Haase <rd...@gmail.com>.

In python, or any streaming program just set the output value to the empty
string and you will get something like "key"\t"".

On Wed, Sep 10, 2014 at 12:03 PM, Susheel Kumar Gadalay <skgadalay@gmail.com
> wrote:

> If you don't want key in the final output, you can set like this in Java.
>
> job.setOutputKeyClass(NullWritable.class);
>
> It will just print the value in the output file.
>
> I don't how to do it in python.
>
> On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> > Hello!
> >
> > Imagine the following common task: I want to process big text file
> > line-by-line using streaming interface.
> > Run unix grep command for instance.  Or some other line-by-line
> processing,
> > e.g. line.upper().
> > I copy file to HDFS.
> >
> > Then I run a map task on this file which reads one line, modifies it some
> > way and then writes it to the output.
> >
> > TextInputFormat suites well for reading: it's key is the offset in bytes
> > (meaningless in my case) and the value is the line itself, so I can
> iterate
> > over line like this (in python):
> > for line in sys.stdin:
> >   print(line.upper())
> >
> > The problem arises with TextOutputFormat:  It tries to split the
> resulting
> > line on mapreduce.output.textoutputformat.separator which results in
> extra
> > separator in output if this character is missing in the line, for
> instance
> > (extra TAB at the end if we stick to defaults).
> >
> > Is there any way to write the result of streaming task without any
> internal
> > processing so it appears exactly as the script produces it?
> >
> > If it is impossible with Hadoop, which works with key/value pairs, may be
> > there are other frameworks which work on top of HDFS which allow to do
> > this?
> >
> > Thanks in advance!
>



-- 
*Kernighan's Law*
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are, by
definition, not smart enough to debug it."

Re: Writing output from streaming task without dealing with key/value

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> Hello!
>
> Imagine the following common task: I want to process big text file
> line-by-line using streaming interface.
> Run unix grep command for instance.  Or some other line-by-line processing,
> e.g. line.upper().
> I copy file to HDFS.
>
> Then I run a map task on this file which reads one line, modifies it some
> way and then writes it to the output.
>
> TextInputFormat suites well for reading: it's key is the offset in bytes
> (meaningless in my case) and the value is the line itself, so I can iterate
> over line like this (in python):
> for line in sys.stdin:
>   print(line.upper())
>
> The problem arises with TextOutputFormat:  It tries to split the resulting
> line on mapreduce.output.textoutputformat.separator which results in extra
> separator in output if this character is missing in the line, for instance
> (extra TAB at the end if we stick to defaults).
>
> Is there any way to write the result of streaming task without any internal
> processing so it appears exactly as the script produces it?
>
> If it is impossible with Hadoop, which works with key/value pairs, may be
> there are other frameworks which work on top of HDFS which allow to do
> this?
>
> Thanks in advance!

Re: Writing output from streaming task without dealing with key/value

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> Hello!
>
> Imagine the following common task: I want to process big text file
> line-by-line using streaming interface.
> Run unix grep command for instance.  Or some other line-by-line processing,
> e.g. line.upper().
> I copy file to HDFS.
>
> Then I run a map task on this file which reads one line, modifies it some
> way and then writes it to the output.
>
> TextInputFormat suites well for reading: it's key is the offset in bytes
> (meaningless in my case) and the value is the line itself, so I can iterate
> over line like this (in python):
> for line in sys.stdin:
>   print(line.upper())
>
> The problem arises with TextOutputFormat:  It tries to split the resulting
> line on mapreduce.output.textoutputformat.separator which results in extra
> separator in output if this character is missing in the line, for instance
> (extra TAB at the end if we stick to defaults).
>
> Is there any way to write the result of streaming task without any internal
> processing so it appears exactly as the script produces it?
>
> If it is impossible with Hadoop, which works with key/value pairs, may be
> there are other frameworks which work on top of HDFS which allow to do
> this?
>
> Thanks in advance!

Re: Writing output from streaming task without dealing with key/value

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> Hello!
>
> Imagine the following common task: I want to process big text file
> line-by-line using streaming interface.
> Run unix grep command for instance.  Or some other line-by-line processing,
> e.g. line.upper().
> I copy file to HDFS.
>
> Then I run a map task on this file which reads one line, modifies it some
> way and then writes it to the output.
>
> TextInputFormat suites well for reading: it's key is the offset in bytes
> (meaningless in my case) and the value is the line itself, so I can iterate
> over line like this (in python):
> for line in sys.stdin:
>   print(line.upper())
>
> The problem arises with TextOutputFormat:  It tries to split the resulting
> line on mapreduce.output.textoutputformat.separator which results in extra
> separator in output if this character is missing in the line, for instance
> (extra TAB at the end if we stick to defaults).
>
> Is there any way to write the result of streaming task without any internal
> processing so it appears exactly as the script produces it?
>
> If it is impossible with Hadoop, which works with key/value pairs, may be
> there are other frameworks which work on top of HDFS which allow to do
> this?
>
> Thanks in advance!

Re: Writing output from streaming task without dealing with key/value

Posted by Susheel Kumar Gadalay <sk...@gmail.com>.

If you don't want key in the final output, you can set like this in Java.

job.setOutputKeyClass(NullWritable.class);

It will just print the value in the output file.

I don't how to do it in python.

On 9/10/14, Dmitry Sivachenko <tr...@gmail.com> wrote:
> Hello!
>
> Imagine the following common task: I want to process big text file
> line-by-line using streaming interface.
> Run unix grep command for instance.  Or some other line-by-line processing,
> e.g. line.upper().
> I copy file to HDFS.
>
> Then I run a map task on this file which reads one line, modifies it some
> way and then writes it to the output.
>
> TextInputFormat suites well for reading: it's key is the offset in bytes
> (meaningless in my case) and the value is the line itself, so I can iterate
> over line like this (in python):
> for line in sys.stdin:
>   print(line.upper())
>
> The problem arises with TextOutputFormat:  It tries to split the resulting
> line on mapreduce.output.textoutputformat.separator which results in extra
> separator in output if this character is missing in the line, for instance
> (extra TAB at the end if we stick to defaults).
>
> Is there any way to write the result of streaming task without any internal
> processing so it appears exactly as the script produces it?
>
> If it is impossible with Hadoop, which works with key/value pairs, may be
> there are other frameworks which work on top of HDFS which allow to do
> this?
>
> Thanks in advance!