You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Anna Lahoud <an...@gmail.com> on 2013/08/27 22:32:11 UTC

Mapreduce Strings from reader, when Avro is clearly Utf8

I am experiencing a problem and I found that another user wrote in about
this same issue in March 2013 but there were no replies to his question. I
am really hoping that there is someone who can explain this or offer
suggestions. I cut and paste his message in since I could only find it in
an archive.

I have Avro files that clearly contain Utf8 and if I run non-mapreduce, I
get Utf8 out. However, with the same files, I get String objects back from
the mapper. Help!?!?!

Message-ID: <51...@gmail.com>
Date: Fri, 08 Mar 2013 10:31:38 -0800
From: Pierre Mariani <pi...@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130221
Thunderbird/17.0.3
MIME-Version: 1.0
To: user@avro.apache.org
Subject: String types in GenericRecord when using mapreduce vs mapred
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Checked: Checked by ClamAV on apache.org

String types in GenericRecord when using mapreduce vs mapred

Depending on the version of the hadoop api I am using, I am getting
generic avro objects that use either Utf8 or java.lang.String to
represent avro strings...

The existing hadoop job which is defined using the old api
(mapred). This job works with Avro file and generic records.

The objects are records. One of their field is "Key", and its value is
a string.

In my mapper, I print the class of the value of the "Key" field for
debugging purposes:

private static class DiffMapper extends AvroMapper<GenericRecord,
Pair<Utf8, GenericRecord>>
{
    @Override
    public void map(GenericRecord record, AvroCollector<Pair<Utf8,
GenericRecord>> collector, Reporter reporter)
       throws IOException
    {
       System.out.println(record.get("Key").getClass());
       ...
       rest of mapper code
       ...

This prints org.apache.avro.util.Utf8


After I ported my job to the new api (mapreduce, see code below), the
debug code
reports that the value is of type String.

private static class DiffMapper extends
Mapper<AvroKey<GenericData.Record>, NullWritable, Text,
AvroValue<GenericData.Record>>
{
    @Override
    public void map(AvroKey<GenericData.Record> key, NullWritable value,
Context context)
       throws IOException, InterruptedException
    {
       GenericData.Record record = key.datum();
       System.out.println(record.get("Key").getClass());
       ...
       rest of mapper code
       ...
    }
}

Is there a way to get the first behavior (String are UTF8) with the
mapreduce api? I am using 1.7.3 from maven central.

Thank you

Re: Mapreduce Strings from reader, when Avro is clearly Utf8

Posted by Marshall Bockrath-Vandegrift <ll...@gmail.com>.
Anna Lahoud <an...@gmail.com> writes:

> I am experiencing a problem and I found that another user wrote in
> about this same issue in March 2013 but there were no replies to his
> question. I am really hoping that there is someone who can explain
> this or offer suggestions. I cut and paste his message in since I
> could only find it in an archive.
>
> I have Avro files that clearly contain Utf8 and if I run
> non-mapreduce, I get Utf8 out. However, with the same files, I get
> String objects back from the mapper. Help!?!?!

There are some confusing differences between the now-named “data models”
used by the `mapred` vs `mapreduce` APIs.  

The Generic{Data,Datum{Reader,Writer}} and Specific implementations
generate `Utf8` instances by default.  The Reflect implementation
generates `String` instances only(?).

In 1.7.4 and earlier: The `mapred` API defaults to using the Specific
implementations (producing `Utf8`s), but may be configured to use the
Reflect implementations via the `...mapred.AvroJob.setReflect()` method.
The `mapreduce` API uses the Reflect implementations and cannot be
configured – and thus always produces `String` instances.  So no dice.

In 1.7.5 (and I hope later): Both the APIs allow you to specify the data
model as a sub-class of `GenericData`.  For example:

    import org.apache.avro.mapreduce.AvroJob;
    ....
    AvroJob.setDataModelClass(job, GenericData.class);

So-setting the job data model should yield the `Utf8` instances you’re
hoping for.

HTH,

-Marshall