You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2010/03/31 04:31:27 UTC
[jira] Commented: (AVRO-493) hadoop mapreduce support for avro data

    [ https://issues.apache.org/jira/browse/AVRO-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851707#action_12851707 ] 

Scott Carey commented on AVRO-493:
----------------------------------

AvroWrapper.java:
{code}  public boolean equals(Object o) {
    return (datum == null) ? o == null : datum.equals(o);
  }{code}

The above looks odd.   Is equals() expected to compare the AvroWrapper to another AvroWrapper or only to the datum?   If the former, then o needs to be cast to AvroWrapper and o.datum used.  If not, then equals() does not adhere to the equals() contract and should be well documented.   (a.equals(b) is not always the same result as b.equals(a))

Eclipse's auto-generate equals() tool generates this code for equals():
{code}
  public boolean equals(Object obj) {
    if (this == obj)
      return true;
    if (obj == null)
      return false;
    if (getClass() != obj.getClass())
      return false;
    AvroWrapper other = (AvroWrapper) obj;
    if (datum == null) {
      if (other.datum != null)
        return false;
    } else if (!datum.equals(other.datum))
      return false;
    return true;
  }
{code}

AvroOutputFormat:
It would be nice if the compression level was configurable.  If not, I lean towards level 1 since it is closest to lzo in speed and still has good compression.  But I suppose intermediate outputs versus final outputs have different requirements here.

AvroMapper:
This creates a new AvroWrapper for each output.collect().  Is this necessary?  AvroReducer does not do this, but it doesn't have the combiner behind it.  If it must be a new instance, a comment might be appropriate to avoid an improper optimization attempt later.

AvroKeySerialization:
I am a bit confused about this class.  It is only referenced in the configuration of AvroJob.
When is this used in the big picture?  It appears that the Serialization interface contract requires no buffering (a performance issue if heavily used, for all serializations).  The decoder here doesn't have access to the avro file and can't use the file format's reader or writer -- so what is it for?  The javadoc on the Hadoop Serialization API doesn't have any information on the purpose of the interfaces.   Is it for serializing data to/from SequenceFiles and other such use cases?

General:
Deprecated APIs are used -- are the replacements not appropriate or insufficient?  I'm no expert on the details of the new API.

> hadoop mapreduce support for avro data
> --------------------------------------
>
>                 Key: AVRO-493
>                 URL: https://issues.apache.org/jira/browse/AVRO-493
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-493.patch
>
>
> Avro should provide support for using Hadoop MapReduce over Avro data files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.