You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Douglas Kaminsky (JIRA)" <ji...@apache.org> on 2011/07/06 23:39:16 UTC

[jira] [Created] (AVRO-853) Cache hash codes in Schema and Field

Cache hash codes in Schema and Field
------------------------------------

                 Key: AVRO-853
                 URL: https://issues.apache.org/jira/browse/AVRO-853
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.5.1
            Reporter: Douglas Kaminsky


We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 

(Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061446#comment-13061446 ] 

Scott Carey commented on AVRO-853:
----------------------------------

{quote}
I think we have two consistent choices:

* Schemas are equal only if all aliases, props, and doc fields match exactly – in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone. In other words, they can exchange serialized data without type promotion.
{quote}

I want to discuss this point a bit more.  These in my mind are the two fundamental use cases for equivalence of schemas:

*  In their JSON form, do they contain the same information?  (requires checking props, aliases, and doc)
or
*  Can these two schemas participate in a bi-directional exchange of serialized information according to the Avro spec? (requires ignoring props, aliases, and doc)

Type promotion in schema resolution cannot be part of equals() because it is asymmmetric.  {"int"} promotesTo {"long"} but not vice-versa.  Like alias translation, this is a separate question and not one of equivalence.  Type promotion is one-way, alias translation is two-way but not transitive.





> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060886#comment-13060886 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

Another approach to your performance issue might be to use IdentityHashMap.

Note that Schema properties can be added at any time, so all schemas are mutable from that perspective, and the hashcode incorporates the properties.  So addProp() should also invalidate the cache.

Instead of having cache logic in many schema subclasses, perhaps the base implementation can be in terms of a new calculateHashcode() method that each of the subclasses implements instead of hashCode.  Then the cache need only be implemented on the base Schema class.


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061446#comment-13061446 ] 

Scott Carey edited comment on AVRO-853 at 7/7/11 6:22 PM:
----------------------------------------------------------

{quote}
I think we have two consistent choices:

* Schemas are equal only if all aliases, props, and doc fields match exactly – in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone. In other words, they can exchange serialized data without type promotion.
{quote}

I want to discuss this point a bit more.  These in my mind are the two fundamental use cases for equivalence of schemas:

*  Does a set of equivalent schemas contain the same information in their JSON form?  (requires equivalent props, aliases, and doc)
or
*  Can a set of equivalent schemas participate in mutial exchange of serialized information according to the Avro spec? (requires ignoring props, aliases, and doc)

Type promotion in schema resolution cannot be part of equals() because it is asymmetric.  {"int"} promotesTo {"long"} but not vice-versa.
Alias translation cannot be part of equals() because it is not transitive (but it is pair-symmetric).
These are separate questions, not one of equivalence.

There are more equivalence classes that may be useful too, such as:
*  Can a set of equivalent schemas participate in mutial exchange of serialized information without losing data, ignoring the Avro spec (do not need to match names, just structure and types)






      was (Author: scott_carey):
    {quote}
I think we have two consistent choices:

* Schemas are equal only if all aliases, props, and doc fields match exactly – in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone. In other words, they can exchange serialized data without type promotion.
{quote}

I want to discuss this point a bit more.  These in my mind are the two fundamental use cases for equivalence of schemas:

*  In their JSON form, do they contain the same information?  (requires checking props, aliases, and doc)
or
*  Can these two schemas participate in a bi-directional exchange of serialized information according to the Avro spec? (requires ignoring props, aliases, and doc)

Type promotion in schema resolution cannot be part of equals() because it is asymmmetric.  {"int"} promotesTo {"long"} but not vice-versa.  Like alias translation, this is a separate question and not one of equivalence.  Type promotion is one-way, alias translation is two-way but not transitive.




  
> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063450#comment-13063450 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

Yes, perhaps another predicate would be useful in addition to 'equals'.  Would the subsumes() method in AVRO-816 suffice?

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061438#comment-13061438 ] 

Scott Carey commented on AVRO-853:
----------------------------------

@Douglas

Good point, we can simplify the hash code functions on complicated members like Props and Aliases.   We can either ignore props, or only use a simple to compute portion of it:  the size.

How should equals with Aliases work?

Are the three below schemas equivalent?  

{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo"]}
{code}

Keep in mind that equals must be transitive, if A == B and B == C implies C == A,  and symmetric C.equals(A) must be true if A.equals(C) is true.
In the above, aliases allow A == B == C.

But this represents a problem for other cases:
{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C2: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}]}
{code}

Aliases allow A == B, and B == C2, but A != C2.  Therefore, we can use aliases in equality only two ways:
# not at all
# exact match only

This means that either 
* Ignore aliases: A == B, B != C, A != C
* Exct match only: A != B, B != C, A != C

I vote for ignoring Aliases in equality checks as we currently do, and having a different version of equals for checking for the ability to transform one schema to another using aliases  "alias promotion".

This is an assymetric process that does not have the transitive property.  A promotesTo B, B promotesTo C2, A !promotesTo C2


I also suspect that we should remove props from equals().  I think those behave similar to aliases.

Are the four schemas below different?  Should they differ across languages or do they represent different data?
{code}
A: {"type":"array", "items":"int"}
B: {"type":"array", "items":"int", "java.typehint":"java.util.List"}
C: {"type":"array", "items":"int", "java.typehint":"intarray"}
{code}

One could argue that these are equal (the serialized form is the same, and the extra properties are only specific to one language implementation).
The props here are just specialized documentation.

I think we have two consistent choices:
* Schemas are equal only if all aliases, props, and doc fields match exactly -- in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064043#comment-13064043 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

What's the use case for quacksLike()?

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060970#comment-13060970 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

It's okay to use schemas as Map keys, but it's an error to change a schema after putting it in a Map.  We don't detect that error.  If schema's were immutable that error would be impossible.

I'm not too worried about using zero to indicate 'invalid hash'.  A more random value might be a bit better, but a collision with this value is not fatal, it just makes the cache ineffective for those few schemas whose hashcode is that value.  We could instead have a flag to indicate that the hashcode is unset, or use a boxed Integer, both of which would use more memory without benefit in 2^32-1 cases.  Whatever value we use should probably be a constant.  We can explicitly check when calculateHashCode returns this value and use a different value when it occurs.

In summary, I think adding a hashcode cache to Schema's would be a good thing to do.  I think this patch could be improved a bit, but I'm generally in favor of committing something like this soon.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061025#comment-13061025 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

> Should a field's Aliases participate in equals() and hashCode()?

I think so.  In most cases that isn't required, but I don't think it does much harm to have them participating and there are probably cases where they are required.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061292#comment-13061292 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Further, I question whether properties actually need to be part of the hash code. Certainly they factor in to the equality check, but what is the real harm if two perfectly identical schemas with slightly different properties end up in the same hash bucket? How often will this actually happen? I can't imagine this would significantly impact hashing performance.

Equal objects should have equal hashcodes, but equal hashcodes don't imply equal objects

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065029#comment-13065029 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Sorry, realize my naming there is confusing - assume:

 A==the first code sample
 B==the second code sample

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061535#comment-13061535 ] 

Scott Carey commented on AVRO-853:
----------------------------------

Related thoughts (for possible new JIRA tickets):

* Aliases should be part of Name.  Only named types (record, enum, fixed) and fields can have aliases and much of the alias resolution and helper methods can be encapsulated in Name, along with the hashCode/equals complications.
* "doc" should also go into Name.  Only named types and fields can have "doc".  The spec seems ambiguous with respect to allowing arbitrary properties named "doc" for unnamed types.  
* Name can be simplified if it did not deal with 'anonymous' records (are these used anywhere other than protocols?).  This can be done if RecordSchema is split in two, one for anonymous records and one for named records.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-853:
------------------------------

      Resolution: Fixed
        Assignee: Doug Cutting
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I committed this.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>            Assignee: Doug Cutting
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065028#comment-13065028 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

The purpose of a "quacksLike" method is to determine if two schemas are structurally equal - in the particular example where we encountered the original slowdown, we were defining a custom method for serializing schemas, where we performed certain optimizations if we had encountered the schema before. In our example, we don't care if we encounter the exact same schema elsewhere in our protocol, nor if the properties or aliases have been modified. For our purposes, structurally equivalent schemas are the same schema...

Take for example the following schema with corresponding fields (in non-JSON to save typing):

{code}
{
 "name" : "A",
 "type" : "record",
 "fields" : [{"name" : "foo", "type" : "int"},
             {"name" : "bar", "type" : "long"}]
}
{code}

Now let's say that at some point another thread (for the purpose of argument) modifies the properties of this schema:

{code}
{
 "name" : "A",
 "type" : "record",
 "fields" : [{"name" : "foo", "type" : "int"},
             {"name" : "bar", "type" : "long"}]
 "java-type-hint" : "some.type.Here"
}
{code}


A.equals(B) == false
A.quacksLike(B) == true

I almost want to say it's about congruence, but a true congruence predicate would probably ignore naming, too.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "James Baldassari (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061039#comment-13061039 ] 

James Baldassari commented on AVRO-853:
---------------------------------------

I just hammered out a quick patch that uses the second approach I described earlier, in which we traverse the entire schema graph and cache the local state at each node in the graph.  The local cache is invalidated upon mutation.  I ran a test in which I called hashCode() on a record schema 10,000 times and measured the amount of time required to complete all hashCode() calls.  This patch decreased the run time of the test by a little over 27% (from 537ms down to 389ms).  It isn't a slam dunk, but I guess it's an improvement.  I haven't had a chance yet to test that mutating the schema causes the hash code to be updated.  I also haven't added the Field aliases to hashCode() and equals(), but if this looks like a good approach I can work on the remaining tasks.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "James Baldassari (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061013#comment-13061013 ] 

James Baldassari commented on AVRO-853:
---------------------------------------

I was just looking for related issues, and I came across AVRO-261, which talks about immutable Schemas and caching of hash codes.  Is there a separate issue for making Schemas completely immutable?

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061285#comment-13061285 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

FYI, I purposely chose 0 instead of null for thread safety purposes, since primitives have different characteristics than objects in multithreaded context. I'm not certain MIN_INT would be any less likely to be chosen than 0 as a hash in general, but if the Avro classes hash to generally positive numbers, picking any negative number as the "unset" value would be an improvement.

As pointed out earlier, the worst case is that the "unset" value gets calculated over and over again and loses the benefit of cache

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-853:
------------------------------

    Attachment: AVRO-853.patch

Here's a version of the patch that avoids replicating cache logic.  All tests pass.

Douglas, does this help your performance issues?

Does anyone object to this change?

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060864#comment-13060864 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Also, the thread assumes single-threaded behavior - I believe in the worst case hash code could still be calculated multiple times due to multithreading visibility issues, but since it's stored as a primitive there should be no chance of having an "invalid state."

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060952#comment-13060952 ] 

Scott Carey commented on AVRO-853:
----------------------------------

@James 
Yes, I understand what you are talking about.  Partial caching with graph traversal would still be a big win.

Ideally immutability with several schema subtypes being referrentially transparent would be great (all "int" schemas with no props can be the same instance (per classloader) for example, so that the == check triggers more often.

It is not safe to assume the calculated hashCode is 0.  We could use MIN_INT since it is less likely to occur, but must still assume that that may occur on occasion.

Generally speaking with respect to thread-safety:
If hashCode or equals is mutating in _ANY WAY_ it is innapropriate to use the objects as a key in a map or in a set -- hashCode and equals must be stable during the lifetime of use in a map.  So from that perspective, oddities produced by caching the values and mutating the state are user error.  It would be nice to remove user error due to mutability from the equation.   


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Douglas Kaminsky updated AVRO-853:
----------------------------------

    Attachment: AVRO-853.patch

Simple hashCode caching submitted as a patch - improve upon it if you like. Calling setFields on a RecordSchema forces the cached hash code to reset to 0. If there are other mutable schemas that aren't handled, please build upon this

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061438#comment-13061438 ] 

Scott Carey edited comment on AVRO-853 at 7/7/11 4:57 PM:
----------------------------------------------------------

@Douglas

Good point, we can simplify the hash code functions on complicated members like Props and Aliases.   We can either ignore props, or only use a simple to compute portion of it:  the size.

How should equals with Aliases work?

Are the three below schemas equivalent?  

{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo"]}
{code}

Keep in mind that equals must be transitive, if A == B and B == C implies C == A,  and symmetric C.equals(A) must be true if A.equals(C) is true.
In the above, aliases allow A == B == C.

But this represents a problem for other cases:
{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C2: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}]}
{code}

Aliases allow A == B, and B == C2, but A != C2.  Therefore, we can use aliases in equality only two ways:
# not at all
# exact match only

This means that either 
* Ignore aliases: A == B, B != C, A != C
* Exct match only: A != B, B != C, A != C

I vote for ignoring Aliases in equality checks as we currently do, and having a different version of equals for checking for the ability to transform one schema to another using aliases  "alias promotion".

This is an assymetric process that does not have the transitive property.  A promotesTo B, B promotesTo C2, A !promotesTo C2


I also suspect that we should remove props from equals().  I think those behave similar to aliases.

Are the three schemas below different?  Should they differ across languages or do they represent different data?
{code}
A: {"type":"array", "items":"int"}
B: {"type":"array", "items":"int", "java.typehint":"java.util.List"}
C: {"type":"array", "items":"int", "java.typehint":"intarray"}
{code}

One could argue that these are equal (the serialized form is the same, and the extra properties are only specific to one language implementation).
The props here are just specialized documentation.

I think we have two consistent choices:
* Schemas are equal only if all aliases, props, and doc fields match exactly -- in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone.  In other words, they can exchange serialized data without type promotion. 

      was (Author: scott_carey):
    @Douglas

Good point, we can simplify the hash code functions on complicated members like Props and Aliases.   We can either ignore props, or only use a simple to compute portion of it:  the size.

How should equals with Aliases work?

Are the three below schemas equivalent?  

{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo"]}
{code}

Keep in mind that equals must be transitive, if A == B and B == C implies C == A,  and symmetric C.equals(A) must be true if A.equals(C) is true.
In the above, aliases allow A == B == C.

But this represents a problem for other cases:
{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C2: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}]}
{code}

Aliases allow A == B, and B == C2, but A != C2.  Therefore, we can use aliases in equality only two ways:
# not at all
# exact match only

This means that either 
* Ignore aliases: A == B, B != C, A != C
* Exct match only: A != B, B != C, A != C

I vote for ignoring Aliases in equality checks as we currently do, and having a different version of equals for checking for the ability to transform one schema to another using aliases  "alias promotion".

This is an assymetric process that does not have the transitive property.  A promotesTo B, B promotesTo C2, A !promotesTo C2


I also suspect that we should remove props from equals().  I think those behave similar to aliases.

Are the four schemas below different?  Should they differ across languages or do they represent different data?
{code}
A: {"type":"array", "items":"int"}
B: {"type":"array", "items":"int", "java.typehint":"java.util.List"}
C: {"type":"array", "items":"int", "java.typehint":"intarray"}
{code}

One could argue that these are equal (the serialized form is the same, and the extra properties are only specific to one language implementation).
The props here are just specialized documentation.

I think we have two consistent choices:
* Schemas are equal only if all aliases, props, and doc fields match exactly -- in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone.
  
> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "James Baldassari (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Baldassari updated AVRO-853:
----------------------------------

    Attachment: AVRO-853-approach2.patch

Here's a patch for the graph traversal with local caching approach.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060865#comment-13060865 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

*the thread=the patch


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061293#comment-13061293 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Further, I question whether properties actually need to be part of the hash code. Certainly they factor in to the equality check, but what is the real harm if two perfectly identical schemas with slightly different properties end up in the same hash bucket? How often will this actually happen? I can't imagine this would significantly impact hashing performance.

Equal objects should have equal hashcodes, but equal hashcodes don't imply equal objects

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "James Baldassari (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060937#comment-13060937 ] 

James Baldassari commented on AVRO-853:
---------------------------------------

I've recently run into this hash code performance issue as well.  Scott, there may be a way to work around the issue of propagating changes up the reference graph.  We would have to calculate hash codes by computing the hash code for any "local" state, which could be cached, and then adding to that local hash code the hash codes of all child objects, which could also be cached independently.  For example, a RecordSchema has 'aliases', 'doc', 'name', 'props', and 'isError', for its local state, and the hash code for those values could be cached.  Then to calculate the rest of the RecordSchema hash code we invoke hashCode() on all Field instances in the RecordSchema.  Each Field instance knows whether it has been modified or not, and so it can either return a cached hash code or recalculate the hash code for its local state, and so on.  So we would still need to traverse the whole schema graph every time the hash code is requested, but we might achieve some performance gains by caching values at each node in the graph.  Does that make sense?

Also, I took a look at the patch, and I have a couple of comments in addition to Doug's and Scott's:

Is it safe to assume that a calculated hash code will never be 0?  Maybe null would be a safer choice for the default/invalidated value.

Thread-safety is actually an issue here.  Since addProp() and getProp() are synchronized we have to assume that Schema is intended to be used by multiple threads.  The worst-case scenario for the unsynchronized hash code cache is worse than just having 2 threads calculate the hash code at once.  If one thread is modifying the schema while another is calling hashCode(), it could result in a temporary inconsistency between hashCode() and equals().  This could lead to some confusing problems with hash maps/sets.  If the caching is abstracted up into the Schema base class as Doug suggests, it would be fairly simple to synchronize access to the cache.  This could be done with a synchronized method/block or by using something like a ReadWriteLock, which would probably have better performance characteristics for a read-frequently-write-infrequently use case such as this.  In fact, there may already be a synchronization issue with Schema because the properties map is not accessed in a synchronized way in equals() or hashCode()...


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071268#comment-13071268 ] 

Scott Carey commented on AVRO-853:
----------------------------------

+1, looks good to me. 



> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060984#comment-13060984 ] 

Scott Carey commented on AVRO-853:
----------------------------------

Something like this is fine for this 'version' of the Schema.  A possible future immutable Schema implementation would differ.

We should add a big javadoc warning around changing Schemas after they have been used in a map or set to the class and setProp.  

Since it is only valid to call setFields() once on a Record, rather than 'reset' the hash, just throw a runtime exception if hashCode is called and the fields are null.  It is an invalid record if it is used prior to setting the fields -- returning a hashCode would be an error.

We should pick a large random number for the sentinel 'unset' hash value rather than 0.  0 is more common than a random int because several simple hash functions can return 0 -- for example String's hashCode has a bias for 0 on common strings.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-853:
------------------------------

    Fix Version/s: 1.6.0

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "James Baldassari (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060965#comment-13060965 ] 

James Baldassari commented on AVRO-853:
---------------------------------------

Immutable Schemas would be great, but since they are mutable today, would you say that it's just not a good idea to use Schema as the key in a hash map?  Maybe a simple workaround for that use case would be to use the Schema's full name - the namespace + schema name - as the key of the map (assuming that is unique of course).

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Douglas Kaminsky updated AVRO-853:
----------------------------------

    Status: Patch Available  (was: Open)

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062220#comment-13062220 ] 

Doug Cutting commented on AVRO-853:
-----------------------------------

As for aliases, two record schemas whose names differ should not be equal via aliases, since they'll read into records of different types.  So differently-named schemas should not be considered equal via aliases.

To record schemas whose names match but with a different set of aliases will be able to read different files and be involved in different protocols; they'll behave differently and thus should not be treated as equal.

Properties are similar: Java reflection uses a property to determine whether to read an array into some kind of a List or into a raw array.  So two schemas that differ only in properties may read the same data into non-equivalent data structures and thus these schemas should not be considered equal.

Also, if something is used by equals() then, if possible, it should be incorporated in to hashCode().  We cannot predict when there might be a large number of items that differ only in this aspect.

I think this patch is basically the right approach for now.  Since mutation after storing in a Map or Set is already an error we cannot detect, perhaps we should not worry about invalidation at all.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060985#comment-13060985 ] 

Scott Carey commented on AVRO-853:
----------------------------------

A related question:

Should a field's Aliases participate in equals() and hashCode() ?


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060916#comment-13060916 ] 

Scott Carey commented on AVRO-853:
----------------------------------

Both props and aliases are currently mutable.  I'd like the whole thing to be immutable, but that is a fairly significant API change.

The problem with triggering the change of a hash code based on modifying props or setting fields is that it does not cascade up the reference graph.  For example, if you have a record containing 1 field, that is a record and you modify the props of the nested record, it can reset its own hash code but won't reset the outer schema hashCode.  Any time we nest a schema in a schema or field this is a problem.

A truly immutable schema data structure fixes this.

I've been thinking about several ways to change how Schema works to make it truly immutable -- including props and aliases -- other than setting fields to handle recursion.  This morning I was working on an alternate form of Schema because I need a "Schema structured" data structure for another project and require manipulating and composing items associated with schema nodes.  That could be a basis for an immutable Schema data structure.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063686#comment-13063686 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Depends on your definition of "quacksLike" - if schemaA has "a" and "b" and schemaB has "b" then schemaA could "quack like" schemaB, so subsumes would work. If I wanted to know if schemaA and schemaB are structurally equal (they both have "a" and "b") I would have to check that schemaA subsumes schemaB AND that schemaB subsumes schemaA, which is an inherently inefficient way to compare structure.

P.S. quacksLike is a bad name, but I like it better than "subsumes" - will comment on AVRO-816 similarly :)

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063377#comment-13063377 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

My only concern is losing hash keys - fixing the hash code the first time hashCode is called would be sufficient to ensure that further retrieval of the same exact object would work, but would prevent the use of hashing to "duck-type" a schema.

e.g. Supposing you encountered this first schema:

{code}
{
  "type": "record", 
  "name": "foo",
  "fields" : [
    {"name": "a", "type": "long"},
    {"name": "b", "type": "boolean"}
  ]
}
{code}

You hash the value, then later encounter this schema:

{code}
{
  "type": "record", 
  "name": "foo",
  "fields" : [
    {"name": "a", "type": "long"},
    {"name": "b", "type": "boolean"}
  ],
  "some.property" : "propvalue"
}
{code}

Your use case may or may not depend on these things being considered equal...

Should there be a second equality method that favors content over structure without forcing the end user to compare the schema internals? Using schemaA and schemaB for the above, something like:

schemaA.equals(schemaB); // false
schemaA.quacksLike(schemaB); // true -- yes, stupid name, please pick something better

That nicely solves the equality problem by letting the end user decide whether they care about form or substance.

As to the original issue, I'd just reiterate that although I agree it is ideal for all "equality" components to play a role in hashing, it is not and should not be a requirement, as long as there is no significant performance impact. I feel sufficient documentation (ie. "hey, btw, you will get poor hashing performance if you try to use 50,000 copies of the boolean schema with different properties as hash keys") would be enough to assuage most developers' concerns.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073187#comment-13073187 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

Looks good from my side. 

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated AVRO-853:
------------------------------

    Attachment: AVRO-853.patch

Here's a new version of the patch that uses the cached hash codes to help equals() fail faster.  Is this worthwhile?  It won't help hash tables but would help, e.g. searching a List of schemas for a match.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>             Fix For: 1.6.0
>
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch, AVRO-853.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Posted by "Douglas Kaminsky (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060895#comment-13060895 ] 

Douglas Kaminsky commented on AVRO-853:
---------------------------------------

I agree I forgot about props, they should also invalidate the cached hash value. You also propose a reasonable optimization.

This is the only method that I have found so far that shows such a dramatic improvement in performance... for the performance test I've been running (writing out 100 records), these are my results:

Status Quo (HashMap, no caching): 4600ms
IdentityMap: 2300ms
HashMap with caching: <100ms

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes sense to cache the hash code on the object instead of recalculating every time the hashCode method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira