You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "W.P. McNeill" <bi...@gmail.com> on 2011/09/09 18:56:00 UTC

What is the best way to implement a key that is an array of strings?

I have a data structure that is a variable-length array of strings. Call it
a StringList. I am using StringLists as Hadoop keys. These objects sort
lexicographically (e.g. ["apple", "banana"] < ["apple", "banana", "pear"] <
["apple", "pear"] < ["zucchini"]) and are equivalent if and only if all of
their elements are equal. What is the best way to implement this object for
Hadoop?

Currently I have implemented StringList as an object that extends
ArrayWritable and sets the value class to Text. The compareTo method just
compares string representations of the StringList objects, since these
representations have the ordering property I desire. This works but I'm
uncertain about how it will perform at scale.

In order to get the highest performance, would I still have to write a raw
comparator for this object, or does ArrayWritable do this for me?

In lieu of writing a raw comparator, should I just implement StringList as
an Avro object? I think Avro gives you raw comparators for free, but I
haven't dug into this.