You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/03/30 23:44:28 UTC

[jira] Created: (AVRO-493) hadoop mapreduce support for avro data

hadoop mapreduce support for avro data
--------------------------------------

                 Key: AVRO-493
                 URL: https://issues.apache.org/jira/browse/AVRO-493
             Project: Avro
          Issue Type: New Feature
          Components: java
            Reporter: Doug Cutting
            Assignee: Doug Cutting


Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851707#action_12851707 ] 

Scott Carey commented on AVRO-493:
----------------------------------

AvroWrapper.java:
{code}  public boolean equals(Object o) {
    return (datum == null) ? o == null : datum.equals(o);
  }{code}

The above looks odd.   Is equals() expected to compare the AvroWrapper to another AvroWrapper or only to the datum?   If the former, then o needs to be cast to AvroWrapper and o.datum used.  If not, then equals() does not adhere to the equals() contract and should be well documented.   (a.equals(b) is not always the same result as b.equals(a))

Eclipse's auto-generate equals() tool generates this code for equals():
{code}
  public boolean equals(Object obj) {
    if (this == obj)
      return true;
    if (obj == null)
      return false;
    if (getClass() != obj.getClass())
      return false;
    AvroWrapper other = (AvroWrapper) obj;
    if (datum == null) {
      if (other.datum != null)
        return false;
    } else if (!datum.equals(other.datum))
      return false;
    return true;
  }
{code}

AvroOutputFormat:
It would be nice if the compression level was configurable.  If not, I lean towards level 1 since it is closest to lzo in speed and still has good compression.  But I suppose intermediate outputs versus final outputs have different requirements here.

AvroMapper:
This creates a new AvroWrapper for each output.collect().  Is this necessary?  AvroReducer does not do this, but it doesn't have the combiner behind it.  If it must be a new instance, a comment might be appropriate to avoid an improper optimization attempt later.

AvroKeySerialization:
I am a bit confused about this class.  It is only referenced in the configuration of AvroJob.
When is this used in the big picture?  It appears that the Serialization interface contract requires no buffering (a performance issue if heavily used, for all serializations).  The decoder here doesn't have access to the avro file and can't use the file format's reader or writer -- so what is it for?  The javadoc on the Hadoop Serialization API doesn't have any information on the purpose of the interfaces.   Is it for serializing data to/from SequenceFiles and other such use cases?

General:
Deprecated APIs are used -- are the replacements not appropriate or insufficient?  I'm no expert on the details of the new API.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852099#action_12852099 ] 

Doug Cutting commented on AVRO-493:
-----------------------------------

> In the short term we could run our unit tests with 1.0.1 [ ...]

That sounds reasonable, but should be a separate issue, no?

Did my answers & patch modifications otherwise satisfy you?


> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (AVRO-493) hadoop mapreduce support for avro data

Posted by Doug Cutting <cu...@apache.org>.

Scott Carey wrote:
> Thats too bad that the intermediate files can't use the avro file format, the performance will suffer until that API changes to either allow custom file formats or to support a feature like the decoder's inputStream() method to allow buffering of chained or interleaved readers.

The intermediate files are part of the mapreduce kernel.  The buffering, 
sorting, transmission and merging of this data is a critical part of 
mapreduce.  So I don't think it is as simple as just permitting a 
pluggable file format.

> FYI, Avro does not work with Hadoop 0.20 for CDH2 or CDH3 (I have not tried plain 0.20) because they include jackson 1.0.1 and you'll get an exception like this:

Can't one update the version of Jackson in one's Hadoop cluster to fix 
this?  However that might not work with Amazon's Electric MapReduce, 
where you don't get to update the cluster (which runs Hadoop 0.18).

Should we avoid using org.codehaus.jackson.JsonFactory.enable() to make 
Avro compatible with older versions of Jackson?

Doug

Re: [jira] Updated: (AVRO-493) hadoop mapreduce support for avro data

Posted by Scott Carey <sc...@richrelevance.com>.

On Mar 31, 2010, at 10:31 AM, Doug Cutting (JIRA) wrote:
>> AvroKeySerialization: I am a bit confused about this class.
> 
> It's used to serialize map outputs and deserialize reduce inputs.  The mapreduce framework uses the job's specified map output key class to find the serialization implementation it uses to read and write intermediate keys and values.
> 
Thats too bad that the intermediate files can't use the avro file format, the performance will suffer until that API changes to either allow custom file formats or to support a feature like the decoder's inputStream() method to allow buffering of chained or interleaved readers.

>> Deprecated APIs are used - are the replacements not appropriate or insufficient?
> 
> Good question.  Hadoop 0.20 deprecated the "old" org.apache.hadoop.mapred APIs to encourage folks to try the new org.apache.hadoop.mapreduce APIs.  However the org.apache.hadoop.mapreduce APIs are not fully functional in 0.20, and folks primarily continue to use the org.apache.hadoop.mapred APIs.  0.20 is used here since it's in Maven repos, but this code should also work against 0.19 and perhaps even 0.18, and I'd compile against one of those instead if it were in a Maven repo.

FYI, Avro does not work with Hadoop 0.20 for CDH2 or CDH3 (I have not tried plain 0.20) because they include jackson 1.0.1 and you'll get an exception like this:

2010-03-31 11:00:55,616 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;
	at org.apache.avro.Schema.<clinit>(Schema.java:81)
	at com.rr.avro.ViewEvent.<clinit>(ViewEvent.java:5)
	at com.rr.eventdata.ViewRecord.<init>(ViewRecord.java:60)
	at com.rr.eventdata.AvroSerializable.<clinit>(AvroSerializable.java:17)
	at com.rr.eventdata.AvroFileReader.createClickDatumReader(AvroFileReader.java:50)

because mappers/reducers don't live in their own classloader space, so the default hadoop lib directory contents have class load order priority.



> 
> 
>> hadoop mapreduce support for avro data
>> --------------------------------------
>> 
>>                Key: AVRO-493
>>                URL: https://issues.apache.org/jira/browse/AVRO-493
>>            Project: Avro
>>         Issue Type: New Feature
>>         Components: java
>>           Reporter: Doug Cutting
>>           Assignee: Doug Cutting
>>        Attachments: AVRO-493.patch, AVRO-493.patch
>> 
>> 
>> Avro should provide support for using Hadoop MapReduce over Avro data files.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Updated: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-493:
------------------------------

    Attachment: AVRO-493.patch

Scott, thanks for the careful review!

> The above looks odd.

Yes, you're right, it was a buggy equals implementation.  I replaced it with the one you provided.  hashCode() is required to support hash-based MapReduce partitioning (the default) and I only provide an equals implementation to be consistent: it's not otherwise required here.  Good catch.

> It would be nice if the compression level was configurable.

Yes, I meant to get to that but forgot.  I've now added it.  Thanks.

> This creates a new AvroWrapper for each output.collect().

Oops.  I originally wrote it that way, but reverted it while debugging to remove a possibility but forgot to restore it.  I've now restored it.

> AvroKeySerialization: I am a bit confused about this class.

It's used to serialize map outputs and deserialize reduce inputs.  The mapreduce framework uses the job's specified map output key class to find the serialization implementation it uses to read and write intermediate keys and values.

> Deprecated APIs are used - are the replacements not appropriate or insufficient?

Good question.  Hadoop 0.20 deprecated the "old" org.apache.hadoop.mapred APIs to encourage folks to try the new org.apache.hadoop.mapreduce APIs.  However the org.apache.hadoop.mapreduce APIs are not fully functional in 0.20, and folks primarily continue to use the org.apache.hadoop.mapred APIs.  0.20 is used here since it's in Maven repos, but this code should also work against 0.19 and perhaps even 0.18, and I'd compile against one of those instead if it were in a Maven repo.


> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved AVRO-493.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.4.0
     Hadoop Flags: [Reviewed]

I just committed this.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>             Fix For: 1.4.0
>
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852177#action_12852177 ] 

Scott Carey commented on AVRO-493:
----------------------------------

bq. Did my answers & patch modifications otherwise satisfy you?

Yes, the patch looks good.

The other items are potential issues that can't be addressed in the patch content itself.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-493:
------------------------------

    Attachment: AVRO-493.patch

Here's a patch that implements this.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852081#action_12852081 ] 

Scott Carey commented on AVRO-493:
----------------------------------

JIRA's "You can reply to this email to add a comment to the issue online."  Doesn't appear to work via the apache mail lists, so I have put the email exchange in the quote below:

{quote}
Scott Carey wrote:
> Thats too bad that the intermediate files can't use the avro file format, the performance will suffer until that API changes to either allow custom file formats or to support a >feature like the decoder's inputStream() method to allow buffering of chained or interleaved readers.

The intermediate files are part of the mapreduce kernel.  The buffering, 
sorting, transmission and merging of this data is a critical part of 
mapreduce.  So I don't think it is as simple as just permitting a 
pluggable file format.

> FYI, Avro does not work with Hadoop 0.20 for CDH2 or CDH3 (I have not tried plain 0.20) because they include jackson 1.0.1 and you'll get an exception like this:

Can't one update the version of Jackson in one's Hadoop cluster to fix 
this?  However that might not work with Amazon's Electric MapReduce, 
where you don't get to update the cluster (which runs Hadoop 0.18).

Should we avoid using org.codehaus.jackson.JsonFactory.enable() to make 
Avro compatible with older versions of Jackson?

Doug
{quote}

Jackson is in Hadoop due to HADOOP-6184 ("Provide a configuration dump in JSON format").  
In my case, I just removed the jar completely from Hadoop because I don't use that feature.   We could make sure our use of the Jackson API is 1.0.1 compatible, but at some point we probably will require the newer version.  There might be bugs in that version that affect Avro, and it will be troublesome if 1.0.1 is silently used and causes bugs or other issues.  

In the short term we could run our unit tests with 1.0.1 and stop using enable() and anything else that we are using that is not 1.0.1 compatible.
We can even change Maven to be a range of supported versions ( example, version [1.0.1-2.x) is 1.0.1 inclusive to 2.x exclusive).

In the long run Hadoop needs to keep its libraries more up to date given its classloader status, and/or implement some classloader partitioning to prevent hadoop system and user code class conflicts, especially due to small features like HADOOP-6184.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch, AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.