You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by anvesh ragi <an...@gmail.com> on 2015/06/10 07:28:28 UTC

hadoop 2.4.0 streaming generic parser options using TAB as separator

Hello all,

I know that the tab is default input separator for fields :

stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator

but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)
stream.map.output.field.separator="\t"

to test how hadoop parses white space characters like "\t,\n" when used as
separators. I observed that hadoop reads it as \t character but not "
 " tab space itself. I checked it by printing each line in reducer (python)
as it reads using :

sys.stdout.write(str(line))

My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but
instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

>From Hadoop streaming - remove trailing tab from reducer output
<http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
I understood that the trailing space is due to
mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output
:

key value1 value2

as key and value as empty Text object since it read the separator from
stream.map.output.field.separator=\t as "\t" character instead of "" tab
space itself.

Please help me understand this behavior and how can I use \t as a separator
if I want to?

Thanks & Regards,
Anvesh R

Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Posted by Kiran Dangeti <ki...@gmail.com>.

\bbb
On Jun 10, 2015 10:58 AM, "anvesh ragi" <an...@gmail.com> wrote:

> Hello all,
>
> I know that the tab is default input separator for fields :
>
> stream.map.output.field.separator
> stream.reduce.input.field.separator
> stream.reduce.output.field.separator
> mapreduce.textoutputformat.separator
>
> but if i try to write the generic parser option :
>
> stream.map.output.field.separator=\t (or)
> stream.map.output.field.separator="\t"
>
> to test how hadoop parses white space characters like "\t,\n" when used as
> separators. I observed that hadoop reads it as \t character but not "
>  " tab space itself. I checked it by printing each line in reducer (python)
> as it reads using :
>
> sys.stdout.write(str(line))
>
> My mapper emits key/value pairs as : key value1 value2
>
> using print (key,value1,value2,sep='\t',end='\n') command.
>
> So I expected my reducer to read each line as : key value1 value2 too,
> but instead sys.stdout.write(str(line)) printed :
>
> key value1 value2 \\with trailing space
>
> From Hadoop streaming - remove trailing tab from reducer output
> <http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
> I understood that the trailing space is due to
> mapreduce.textoutputformat.separator not being set and left as default.
>
> So, this confirmed my assumption that hadoop considered my total map
> output :
>
> key value1 value2
>
> as key and value as empty Text object since it read the separator from
> stream.map.output.field.separator=\t as "\t" character instead of "" tab
> space itself.
>
> Please help me understand this behavior and how can I use \t as a
> separator if I want to?
>
> Thanks & Regards,
> Anvesh R
>
>

Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Posted by Kiran Dangeti <ki...@gmail.com>.

\bbb
On Jun 10, 2015 10:58 AM, "anvesh ragi" <an...@gmail.com> wrote:

> Hello all,
>
> I know that the tab is default input separator for fields :
>
> stream.map.output.field.separator
> stream.reduce.input.field.separator
> stream.reduce.output.field.separator
> mapreduce.textoutputformat.separator
>
> but if i try to write the generic parser option :
>
> stream.map.output.field.separator=\t (or)
> stream.map.output.field.separator="\t"
>
> to test how hadoop parses white space characters like "\t,\n" when used as
> separators. I observed that hadoop reads it as \t character but not "
>  " tab space itself. I checked it by printing each line in reducer (python)
> as it reads using :
>
> sys.stdout.write(str(line))
>
> My mapper emits key/value pairs as : key value1 value2
>
> using print (key,value1,value2,sep='\t',end='\n') command.
>
> So I expected my reducer to read each line as : key value1 value2 too,
> but instead sys.stdout.write(str(line)) printed :
>
> key value1 value2 \\with trailing space
>
> From Hadoop streaming - remove trailing tab from reducer output
> <http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
> I understood that the trailing space is due to
> mapreduce.textoutputformat.separator not being set and left as default.
>
> So, this confirmed my assumption that hadoop considered my total map
> output :
>
> key value1 value2
>
> as key and value as empty Text object since it read the separator from
> stream.map.output.field.separator=\t as "\t" character instead of "" tab
> space itself.
>
> Please help me understand this behavior and how can I use \t as a
> separator if I want to?
>
> Thanks & Regards,
> Anvesh R
>
>

Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Posted by Kiran Dangeti <ki...@gmail.com>.

\bbb
On Jun 10, 2015 10:58 AM, "anvesh ragi" <an...@gmail.com> wrote:

> Hello all,
>
> I know that the tab is default input separator for fields :
>
> stream.map.output.field.separator
> stream.reduce.input.field.separator
> stream.reduce.output.field.separator
> mapreduce.textoutputformat.separator
>
> but if i try to write the generic parser option :
>
> stream.map.output.field.separator=\t (or)
> stream.map.output.field.separator="\t"
>
> to test how hadoop parses white space characters like "\t,\n" when used as
> separators. I observed that hadoop reads it as \t character but not "
>  " tab space itself. I checked it by printing each line in reducer (python)
> as it reads using :
>
> sys.stdout.write(str(line))
>
> My mapper emits key/value pairs as : key value1 value2
>
> using print (key,value1,value2,sep='\t',end='\n') command.
>
> So I expected my reducer to read each line as : key value1 value2 too,
> but instead sys.stdout.write(str(line)) printed :
>
> key value1 value2 \\with trailing space
>
> From Hadoop streaming - remove trailing tab from reducer output
> <http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
> I understood that the trailing space is due to
> mapreduce.textoutputformat.separator not being set and left as default.
>
> So, this confirmed my assumption that hadoop considered my total map
> output :
>
> key value1 value2
>
> as key and value as empty Text object since it read the separator from
> stream.map.output.field.separator=\t as "\t" character instead of "" tab
> space itself.
>
> Please help me understand this behavior and how can I use \t as a
> separator if I want to?
>
> Thanks & Regards,
> Anvesh R
>
>

Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Posted by Kiran Dangeti <ki...@gmail.com>.

\bbb
On Jun 10, 2015 10:58 AM, "anvesh ragi" <an...@gmail.com> wrote:

> Hello all,
>
> I know that the tab is default input separator for fields :
>
> stream.map.output.field.separator
> stream.reduce.input.field.separator
> stream.reduce.output.field.separator
> mapreduce.textoutputformat.separator
>
> but if i try to write the generic parser option :
>
> stream.map.output.field.separator=\t (or)
> stream.map.output.field.separator="\t"
>
> to test how hadoop parses white space characters like "\t,\n" when used as
> separators. I observed that hadoop reads it as \t character but not "
>  " tab space itself. I checked it by printing each line in reducer (python)
> as it reads using :
>
> sys.stdout.write(str(line))
>
> My mapper emits key/value pairs as : key value1 value2
>
> using print (key,value1,value2,sep='\t',end='\n') command.
>
> So I expected my reducer to read each line as : key value1 value2 too,
> but instead sys.stdout.write(str(line)) printed :
>
> key value1 value2 \\with trailing space
>
> From Hadoop streaming - remove trailing tab from reducer output
> <http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
> I understood that the trailing space is due to
> mapreduce.textoutputformat.separator not being set and left as default.
>
> So, this confirmed my assumption that hadoop considered my total map
> output :
>
> key value1 value2
>
> as key and value as empty Text object since it read the separator from
> stream.map.output.field.separator=\t as "\t" character instead of "" tab
> space itself.
>
> Please help me understand this behavior and how can I use \t as a
> separator if I want to?
>
> Thanks & Regards,
> Anvesh R
>
>