You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Philip (flip) Kromer" <fl...@infochimps.org> on 2008/12/15 13:54:13 UTC
KeyFieldBasedPartitioner fiddling with backslashed values
I'm having a weird issue.
When I invoke my mapreduce with a secondary sort using
the KeyFieldBasedPartitioner, it's altering lines containing backslashes.
Or I've made some foolish conceptual error and my script is doing so, but
only when there's a partitioner. Any advice welcome. I've attached the
script and a bowdlerized copy of the output, since I fear the worst for the
formatting on the text below.
With no partitioner, among a few million other million lines, my script
produces this one correctly:
=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========
( was called using: )
hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
-mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
-reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
-input rawd/keyed/_20081205'/user-keyed.tsv' \
-output out/"parsed-$output_id"
Note that the website field contained
http://http:\\www.MyWebsitee.com
(this person clearly either fails at internet or wins at windows)
When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on
these few lines with backslashes, generating instead a single backslash
followed by a tab:
=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========
( was called using: )
hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf map.output.key.field.separator='\t' \
-jobconf num.key.fields.for.partition=1 \
-jobconf stream.map.output.field.separator='\t' \
-jobconf stream.num.map.output.key.fields=2 \
-mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
-reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
-input rawd/keyed/_20081205'/user-keyed.tsv' \
-output out/"parsed-$output_id"
When I run the script on the command line
cat input | hadoop_parse_json.rb | sort -k1,2
| hadoop_uniq_without_timestamp.rb
everything works as I'd like.
I've hunted through the JIRA and found nothing.
If this sounds like a problem with hadoop I'll try to isolate a proper test
case.
Thanks for any advice,
flip