You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Philip (flip) Kromer" <fl...@infochimps.org> on 2008/12/15 13:54:13 UTC
KeyFieldBasedPartitioner fiddling with backslashed values

I'm having a weird issue.

When I invoke my mapreduce with a secondary sort using
the KeyFieldBasedPartitioner, it's altering lines containing backslashes.
 Or I've made some foolish conceptual error and my script is doing so, but
only when there's a partitioner.  Any advice welcome.  I've attached the
script and a bowdlerized copy of the output, since I fear the worst for the
formatting on the text below.

With no partitioner, among a few million other million lines, my script
produces this one correctly:

=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\\www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========


( was called using: )


hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
    -mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"


Note that the website field contained
  http://http:\\www.MyWebsitee.com
(this person clearly either fails at internet or wins at windows)

When I use a KeyFieldBasedPartitioner, it behaves correctly *except* on
these few lines with backslashes, generating instead a single backslash
followed by a tab:


=========
twitter_user_profile twitter_user_profile-0000018421-20081205-184526
0000018421 M...e http://http:\ www.MyWebsitee.com S, NJ I... notice. Eastern
Time (US & Canada) -18000 20081205-184526
=========


( was called using: )

hadoop jar /home/flip/hadoop/h/contrib/streaming/hadoop-*-streaming.jar \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -jobconf    map.output.key.field.separator='\t' \
    -jobconf    num.key.fields.for.partition=1 \
    -jobconf stream.map.output.field.separator='\t' \
    -jobconf stream.num.map.output.key.fields=2 \
    -mapper
/home/flip/ics/pool/social/network/twitter_friends/hadoop_parse_json.rb \
    -reducer /home/flip/ics/pool/social/network/twitter_friends/hadoop_uniq_without_timestamp.rb
\
    -input      rawd/keyed/_20081205'/user-keyed.tsv' \
    -output  out/"parsed-$output_id"


When I run the script on the command line
  cat input | hadoop_parse_json.rb | sort -k1,2
| hadoop_uniq_without_timestamp.rb
everything works as I'd like.

I've hunted through the JIRA and found nothing.
If this sounds like a problem with hadoop I'll try to isolate a proper test
case.

Thanks for any advice,
flip