You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Aleksandr Elbakyan <ra...@yahoo.com> on 2014/04/29 21:56:13 UTC

Issue with partioning of data using hadoop streaming

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?

Re: Issue with partitioning of data using hadoop streaming

Posted by Aleksandr Elbakyan <ra...@yahoo.com>.


Any suggestions?

---------

Hello,

I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.


My data look like:

10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data

up to 250

The data is around 50 mln lines. 

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
    -D mapred.task.timeout=3600000 \
    -D mapred.map.tasks=25 \
    -D stream.non.zero.exit.is.failure=true
 \
    -D mapred.reduce.tasks=25 \
    -D mapred.output.compress=true \
    -D mapred.text.key.partitioner.options=-k1,1n \
    -D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
    -input "input" \
    -output "output" \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf stream.map.output.field.separator=. \
    -jobconf stream.num.map.output.key.fields=1 \
    -jobconf map.output.key.field.separator=\t \
    -jobconf num.key.fields.for.partition=1 \
    -mapper " cat
 " \
    -reducer " cat "




other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100


Any suggestion how to fix this?