You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2010/04/28 04:40:32 UTC

[jira] Created: (HADOOP-6729) serializer.JavaSerialization should be added to io.serializations by default

serializer.JavaSerialization should be added to io.serializations by default
----------------------------------------------------------------------------

                 Key: HADOOP-6729
                 URL: https://issues.apache.org/jira/browse/HADOOP-6729
             Project: Hadoop Common
          Issue Type: Improvement
          Components: conf
    Affects Versions: 0.20.2
            Reporter: Ted Yu


org.apache.hadoop.io.serializer.JavaSerialization isn't included in io.serializations by default.

When a class which implements the Serializable interface is used, user would see the following without serializer.JavaSerialization:

java.lang.NullPointerException
   at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
   at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:759)
   at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:487)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:575)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6729) serializer.JavaSerialization should be added to io.serializations by default

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861657#action_12861657 ] 

Tom White commented on HADOOP-6729:
-----------------------------------

The JavaSerialization was written as an experimental serialization, to prove the abstraction. I'm not sure it should be enabled by default since it's not efficient, hasn't been tested at scale (as far as I know), and we should encourage users to use other serializations like Writables or Avro.

In any case, we could improve the error message if the type is Serializable, or print a warning.

> serializer.JavaSerialization should be added to io.serializations by default
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-6729
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6729
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: conf
>    Affects Versions: 0.20.2
>            Reporter: Ted Yu
>
> org.apache.hadoop.io.serializer.JavaSerialization isn't included in io.serializations by default.
> When a class which implements the Serializable interface is used, user would see the following without serializer.JavaSerialization:
> java.lang.NullPointerException
>    at
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>    at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:759)
>    at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:487)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:575)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6729) serializer.JavaSerialization should be added to io.serializations by default

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861847#action_12861847 ] 

Tom White commented on HADOOP-6729:
-----------------------------------

One inefficiency of JavaSerialization is the fact that it stores the classname with every record. This is actually worse than normal Java serialization, which uses backreferences to classnames to make the resulting stream more compact. This optimization is disabled in Hadoop (see JavaSerializationSerializer#serialize()) because records are reordered in the shuffle, which would break back references.

Another inefficiency is that JavaSerialization creates a new object every time the deserialize() is called. In the context of large scale data processing, where there may be billions of records, this is very expensive, which is why Writables and Avro reuse instances.

> serializer.JavaSerialization should be added to io.serializations by default
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-6729
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6729
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: conf
>    Affects Versions: 0.20.2
>            Reporter: Ted Yu
>
> org.apache.hadoop.io.serializer.JavaSerialization isn't included in io.serializations by default.
> When a class which implements the Serializable interface is used, user would see the following without serializer.JavaSerialization:
> java.lang.NullPointerException
>    at
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>    at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:759)
>    at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:487)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:575)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.