You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Aleksandr Elbakyan <ra...@yahoo.com> on 2014/04/29 21:56:13 UTC
Issue with partioning of data using hadoop streaming
Hello,
I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.
My data look like:
10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data
up to 250
The data is around 50 mln lines.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
-D mapred.task.timeout=3600000 \
-D mapred.map.tasks=25 \
-D stream.non.zero.exit.is.failure=true \
-D mapred.reduce.tasks=25 \
-D mapred.output.compress=true \
-D mapred.text.key.partitioner.options=-k1,1n \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input "input" \
-output "output" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf map.output.key.field.separator=\t \
-jobconf num.key.fields.for.partition=1 \
-mapper " cat " \
-reducer " cat "
other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100
Any suggestion how to fix this?
Re: Issue with partitioning of data using hadoop streaming
Posted by Aleksandr Elbakyan <ra...@yahoo.com>.
Any suggestions?
---------
Hello,
I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.
My data look like:
10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data
up to 250
The data is around 50 mln lines.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
-D mapred.task.timeout=3600000 \
-D mapred.map.tasks=25 \
-D stream.non.zero.exit.is.failure=true
\
-D mapred.reduce.tasks=25 \
-D mapred.output.compress=true \
-D mapred.text.key.partitioner.options=-k1,1n \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input "input" \
-output "output" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf map.output.key.field.separator=\t \
-jobconf num.key.fields.for.partition=1 \
-mapper " cat
" \
-reducer " cat "
other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100
Any suggestion how to fix this?
Re: Issue with partitioning of data using hadoop streaming
Posted by Aleksandr Elbakyan <ra...@yahoo.com>.
Any suggestions?
---------
Hello,
I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.
My data look like:
10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data
up to 250
The data is around 50 mln lines.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
-D mapred.task.timeout=3600000 \
-D mapred.map.tasks=25 \
-D stream.non.zero.exit.is.failure=true
\
-D mapred.reduce.tasks=25 \
-D mapred.output.compress=true \
-D mapred.text.key.partitioner.options=-k1,1n \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input "input" \
-output "output" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf map.output.key.field.separator=\t \
-jobconf num.key.fields.for.partition=1 \
-mapper " cat
" \
-reducer " cat "
other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100
Any suggestion how to fix this?
Re: Issue with partitioning of data using hadoop streaming
Posted by Aleksandr Elbakyan <ra...@yahoo.com>.
Any suggestions?
---------
Hello,
I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.
My data look like:
10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data
up to 250
The data is around 50 mln lines.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
-D mapred.task.timeout=3600000 \
-D mapred.map.tasks=25 \
-D stream.non.zero.exit.is.failure=true
\
-D mapred.reduce.tasks=25 \
-D mapred.output.compress=true \
-D mapred.text.key.partitioner.options=-k1,1n \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input "input" \
-output "output" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf map.output.key.field.separator=\t \
-jobconf num.key.fields.for.partition=1 \
-mapper " cat
" \
-reducer " cat "
other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100
Any suggestion how to fix this?
Re: Issue with partitioning of data using hadoop streaming
Posted by Aleksandr Elbakyan <ra...@yahoo.com>.
Any suggestions?
---------
Hello,
I am having issue with partitioning data between mapper and reducers when the key is numeric. When I switch it to one character string it works fine, but I have more then 26 keys so looking to alternative way.
My data look like:
10 \t comment10 \t data
20 \t comment20 \t data
30 \t comment30 \t data
40 \t comment40 \t data
up to 250
The data is around 50 mln lines.
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.20.2+228-streaming.jar \
-D mapred.task.timeout=3600000 \
-D mapred.map.tasks=25 \
-D stream.non.zero.exit.is.failure=true
\
-D mapred.reduce.tasks=25 \
-D mapred.output.compress=true \
-D mapred.text.key.partitioner.options=-k1,1n \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input "input" \
-output "output" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf stream.map.output.field.separator=. \
-jobconf stream.num.map.output.key.fields=1 \
-jobconf map.output.key.field.separator=\t \
-jobconf num.key.fields.for.partition=1 \
-mapper " cat
" \
-reducer " cat "
other issue I have stream.map.output.field.separator when I put it as a tab it adds space in my data when keys are bigger or eq to 100
Any suggestion how to fix this?