You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "Hernan Otero (Created) (JIRA)" <ji...@apache.org> on 2011/10/27 23:30:32 UTC

[jira] [Created] (AVRO-946) GenericData.resolveUnion() performance improvement

GenericData.resolveUnion() performance improvement
--------------------------------------------------

                 Key: AVRO-946
                 URL: https://issues.apache.org/jira/browse/AVRO-946
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.6.0
            Reporter: Hernan Otero


Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):

{code}
  public int resolveUnion(Schema union, Object datum) {
    int i = 0;
    for (Schema type : union.getTypes()) {
      if (instanceOf(type, datum))
        return i;
      i++;
    }
    throw new UnresolvedUnionException(union, datum);
  }
{code}

it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.

Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:

{code}
public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
{
  unionIndexCacheFactory = factory;
}
{code}

This is what the interface and identity hash map-based implementation would look like:

{code}
  /**
   * A factory interface for creating UnionTypeIndexCache instances.
   */
  public static interface UnionIndexCacheFactory
  {
      UnionIndexCache createUnionIndexCache(List<Schema> types);

      /**
       * Used for caching schema indices within a union.
       */
      public static interface UnionIndexCache
      {
          void setTypeIndex(Schema schema, int index);

          int getTypeIndex(Schema schema);
      }

  }

  private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
  {
      @Override
      public UnionIndexCache createUnionIndexCache(List<Schema> types)
      {
          return new UnionIndexCache()
          {
              private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();

              @Override
              public void setTypeIndex(Schema schema, int index)
              {
                  schemaToIndex.put(schema, index);
              }

              @Override
              public int getTypeIndex(Schema schema)
              {
                  Integer index = schemaToIndex.get(schema);
                  return index == null ? -1 : index;
              }
          };
      }
  }
{code}

I will attach a patch later today or early tomorrow.

Thanks in advance,

Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Scott Carey (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141582#comment-13141582 ] 

Scott Carey commented on AVRO-946:
----------------------------------

There are a few other cases where we have dealt with identity caches of Schemas.  The thread safety issue can't be dealt with by a generic ConcurrentHashMap since we need weak keys and/or values to prevent a resource leak. 

The solution for the resolver was to use a ThreadLocal weak map.  As long as the ThreadLocal is static or otherwise a near-singleton, this works.  If it is per instance on something that is instantiated often, ThreadLocal has performance issues.


                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142216#comment-13142216 ] 

Doug Cutting commented on AVRO-946:
-----------------------------------

Hernan, that sounds like a good plan to me.  Would you like to update the patch or should I?
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hernan Otero updated AVRO-946:
------------------------------

    Attachment: AVRO-946.patch

Patch with suggested implementation.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140484#comment-13140484 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

I've implemented your proposed solution using the following data structure:

{code}IdentityHashMap<Schema, WeakHashMap<Schema, Integer>>{code}

The first (identity) map's key being the Union itself, the second's (weak) being the datum's schema.

As a side note, I did try implementing this using MultiKeyMap delegating to a ReferenceMap, but was unable to override equalsKey() on the MultiKeyMap due to package accessibility constraints for AbstractHashedMap.HashEntry (it may well be I misunderstood MultiKeyMap's extensibility features as I don't have much experience with it, please feel free to suggest an alternative data structure).

I implemented the feature by adding 3 protected methods (for creating, caching and looking up the entries within the cache), so subclasses of GenericDatumWriter should be able to provide alternative implementations if required:

{code}
protected void createUnionIndexCacheMap();
protected int getCachedUnionIndex(Schema union, Schema datumSchema);
protected void putCachedUnionIndex(Schema union, Schema datumSchema, int index);
{code}

Here are some stats I collected.  The test uses a single record with 3 fields of the same union type.  This union type has 40 different types to choose from.  "best/worst" describe scenarios from the point of view of today's (sequential) implementation:

* best.  Fields are set with the first type within the union.
* worst.  Fields are set with the last type within the union.

These are the results I obtained:

{code}
-----------------------------------------------------
| Scenario |   Time    |       Rate         | Cache |
-----------------------------------------------------
| best     | 11.1 secs |  180,685 loops/sec | true  |
| best     | 12.1 secs |  164,935 loops/sec | false |
| worst    | 11.2 secs |  178,364 loops/sec | true  |
| worst    | 19.4 secs |  102,912 loops/sec | false |
-----------------------------------------------------
{code}

As expected, using the cache shows little difference between the two scenarios (but manages to be faster even in the best-case scenario!), whereas today's (sequential) implementation is significantly slower in the worst case scenario.

I will submit the patch in a few minutes.

Thanks,

Hernan
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141567#comment-13141567 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

On further thought, the proposed implementation does have a shortcoming.  In order to leverage the optimization, the GenericDatumWriter needs to be shared.  And the current cache implementation is not thread safe.

One option would be to make the cache thread safe (e.g. use a ConcurrentMap or similar structure), a second option would be to move this all back to UnionSchema, but for the time being (pending the longer term solution of making UnionSchema public and extensible), rely on a HashMap<String, Integer> using the datum's schema's getFullName() as key (to avoid the need to rely on identity).

Any thoughts/suggestions?

Thanks,

Hernan
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138511#comment-13138511 ] 

Doug Cutting commented on AVRO-946:
-----------------------------------

I'd prefer this not be a global setting for all union schemas in the JVM.

A longer-term approach might be to make UnionSchema a public extensible class, and provide a visitor/copier API for schemas so that one can easily create a version of a schema replacing the implementations of some elements, like unions.

A good near-term approach might be to add this functionality to GenericDatumWriter.  A MultiKeyMap should provide good performance (http://s.apache.org/c1J).  Note that one can override the hash function used by MultiKeyMap to make it identity or even a perfect hash.  (For a given GenericDatumWriter the schema is fixed so all unions contained in it can be enumerated.)

To be more concrete, instead of calling GenericData.resolveUnion(), GenericDatumWriter() could have its own version of ResolveUnion the uses a MultiKeyMap cache indexed by the union Schema and the value's type.  On misses, the cache can be populated by calling GenericData.resolveUnion().  The hash function can be overidden to be identity for both keys.

Might something like this work for you?  I think it should perform similarly to directly storing the cache in the union Schema.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-946:
------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.1
         Assignee: Doug Cutting
           Status: Resolved  (was: Patch Available)

I committed this.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>            Assignee: Doug Cutting
>             Fix For: 1.6.1
>
>         Attachments: AVRO-946.patch, AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hernan Otero updated AVRO-946:
------------------------------

    Status: Patch Available  (was: Open)

Patch submitted.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-946:
------------------------------

    Attachment: AVRO-946.patch

Here's a version of the patch that also adds a resolveUnion() method to GenericDatumReader that delegates to the GenericData method of the same name to permit optimized implementations.

If there are no objections I'll commit this.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140133#comment-13140133 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

Thanks for the feed back.

Yes, I think your proposed solution would also solve the performance issue.

The only question I'd have is regarding adding the Union as one of the keys in the MultiKeyMap (in GenericDatumWriter).  Unless we force identity equality (something we can't rely on within our project), there's nothing else we could use within the Union as an alternative that would still make it unique within the entire schema, is there? (this is only a problem for the Union, for the other key, the datum's schema, we could use getFullName() as an alternative).

Or am I missing something?

Thanks!

Hernan
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142126#comment-13142126 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

Yes, this implementation gives us a big boost in performance (a little less than using the identity map, but still a significant improvement over no map):

{code}
-----------------------------------------------------
| Scenario |   Time    |       Rate         | Cache |
-----------------------------------------------------
| best     | 11.9 secs |  168,449 loops/sec | true  |
| worst    | 12.2 secs |  164,204 loops/sec | true  |
-----------------------------------------------------
{code}

                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142134#comment-13142134 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

I would actually suggest a combined solution.  This representation-independent index in UnionSchema, combined with creating a protected resolveUnion() method in GenericDatumWriter that just delegates to GenericData#resolveUnion()).  That way, we still have a way to optimize further if needed by subclassing GenericDatumWriter (combined with the proposed ThreadLocals for type safety).  At least until the longer term extensibility of Schema subtypes comes along.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-946:
------------------------------

    Attachment: AVRO-946.patch

It would certainly be nice to cache things directly in the union schema.  I don't think we ought to have representation-specific stuff in Schema, but perhaps we can add a representation-independent random-access table to each union schema.

The value of Schema.getFullName() for each branch in a union is unique within that union.  A Map<String,Schema> in each union might be useful.  Here's a patch that adds such a thing and uses it to implement resolveUnion().  Might this work?
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142227#comment-13142227 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

Please go ahead if you don't mind.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140260#comment-13140260 ] 

Doug Cutting commented on AVRO-946:
-----------------------------------

Identity equality may result in multiple entries for a given schema but the cache should still work correctly.  It would perform poorly if every instance had a different schema, but that's not likely.

Also note that Schema now caches hash codes.  So even using equals hashing would usually only result in a single call to equals, to verify the hash entry.  Equals is fast for identical objects, so, if you used equals hashing, the slow case would be when the cached key is equal but not identical.

I think identity hashing with weak keys is probably preferable.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Posted by "Hernan Otero (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142135#comment-13142135 ] 

Hernan Otero commented on AVRO-946:
-----------------------------------

Oops: type safety -> thread safety
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance).  The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases).  This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira