You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by GitBox <gi...@apache.org> on 2022/06/29 09:05:28 UTC

[GitHub] [avro] clesaec opened a new pull request, #1742: Avro 3527 optim hashcode

clesaec opened a new pull request, #1742:
URL: https://github.com/apache/avro/pull/1742

   [AVRO-3527](https://issues.apache.org/jira/browse/AVRO-3527) : Small optimization on hashcode to avoid whole record exploration.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] KalleOlaviNiemitalo commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
KalleOlaviNiemitalo commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1312634759

   The difficulty is in recursive calls
   
   <https://github.com/apache/avro/blob/3f55816d2baec1ec6bd1ed4be69fd6977a8316d5/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1116>
   <https://github.com/apache/avro/blob/3f55816d2baec1ec6bd1ed4be69fd6977a8316d5/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1127>
   
   The limit is set to 10 hashes.  Suppose you have a record type R1 with two properties, each of which has record type R2, which in turn has seven string properties.  Now when you call hashCode on the R1 instance, it should hash all 7 strings of the first R2 instance but only the first 3 strings of the second R2 instance.  So the recursive hashCode call on the first R2 instance must somehow let the caller know how many values it hashed.  Passing an AtomicInteger as a parameter achieves that because hashCode can decrement the value of the AtomicInteger and this change is visible to the caller.  If you passed an int counter parameter instead, then you'd have to return the counter change to the caller in some other way; perhaps by returning an instance of a class rather than just the hash code as int.
   
   That said -- it might be advantageous to replace the AtomicInteger with a class that contains both the counter and the hash code being computed.  The hashCodeAdd method effectively multiplies hash codes by powers of 31:
   
   <https://github.com/apache/avro/blob/3f55816d2baec1ec6bd1ed4be69fd6977a8316d5/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1143-L1146>
   
   In the scenario with record types R1 and R2, the hash codes of the strings in the first R2 instance get multiplied with 31 raised to the powers of 6, 5, 4, 3, 2, and 1.  The hash codes of the first three strings in the second R2 instance get multiplied with 31 raised to the powers of 2, 1, and 0.  So two of these multipliers get reused.  If there instead were a static class HashCodeCollector that had one int field for the hash code and another int field for the counter, and all the recursive calls fed the values to that, then it would be able to give a separate multiplier to the hash code of each value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] clesaec commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
clesaec commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1313313940

   So, in the example
   `31*(31*(31*(31*(31*(31*hash(f1) + hash(f2)) + hash(f3)) + hash(f4)) + hash(f'1)) + hash(f'2)) + hash(f'3)`
   ?
   Indeed, that sounds logic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] dkulp merged pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
dkulp merged PR #1742:
URL: https://github.com/apache/avro/pull/1742


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] clesaec commented on pull request #1742: Avro 3527 optim hashcode

Posted by GitBox <gi...@apache.org>.
clesaec commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1169741018

   > I see, you use AtomicInteger just as a mutable boxed integer, and don't actually care about its atomicity with regard to other threads.
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] clesaec commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
clesaec commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1313283142

   Indeed, AtomicInteger, atomicity is may be too much, and an array of size one (int[]) may be to blurry, i will create a specific private class for that.
   For optimization with calculating once each power of 31; it means transform
   ```
   31*(31*(31*(31*hash(f1) + hash(f2)) + hash(f3)) + hash(f4))
   31*(31*hash(f'1) + hash(f'2)) + hash(f'3)
   ```
   into 
   `31*(31*(31*(31*hash(f1) + hash(f2)) + hash(f3) + hash(f'1)) + hash(f4) + hash(f'2)) + hash(f'3)`
   But, as one field, as f2, can also be itself an Array of record containing itself fields ... , it will lead to too complex code for saving at most 10 multiplications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] nandorKollar commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
nandorKollar commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1312559615

   > Integer is not mutable so you'd need an instance of another class (perhaps an one-element array) to hold the Integer reference.
   > 
   > If the atomicity is too much, then perhaps it's clearest to just define a non-atomic mutable state class with an int field here.
   
   I see, thanks for the explanation. Looks like the point here is on the mutable nature of AtomicInteger, and not to support concurrency, I sounds a bit misleading to me. How about using simply the 'int' primitive type for the counter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] KalleOlaviNiemitalo commented on pull request #1742: Avro 3527 optim hashcode

Posted by GitBox <gi...@apache.org>.
KalleOlaviNiemitalo commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1169730627

   I see, you use AtomicInteger just as a mutable boxed integer, and don't actually care about its atomicity with regard to other threads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] KalleOlaviNiemitalo commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
KalleOlaviNiemitalo commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1312547175

   Integer is not mutable so you'd need an instance of another class (perhaps an one-element array) to hold the Integer reference.
   
   If the atomicity is too much, then perhaps it's clearest to just define a non-atomic mutable state class with an int field here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] KalleOlaviNiemitalo commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
KalleOlaviNiemitalo commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1313301234

   I meant using a separate power of 31 for each hashed value could make the hash codes more unique, without increasing the number of multiplications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] nandorKollar commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
nandorKollar commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1312545555

   What's the advantage of using an AtomicInteger instead of a simple Integer in this context? The counter isn't part of the shared mutable state of GenericData, could you please mention a scenario why AtomicInteger is better than Integer in this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [avro] clesaec commented on pull request #1742: AVRO-3527: Optimize GenericData.hashCode

Posted by GitBox <gi...@apache.org>.
clesaec commented on PR #1742:
URL: https://github.com/apache/avro/pull/1742#issuecomment-1313628521

   So, i put hashcode calculation logic in an internal class, so counter became a simple int and 31* is apply to each step and code is easier to read.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@avro.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org