You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Kashyap Mhaisekar <ka...@gmail.com> on 2015/09/29 23:28:39 UTC

Field Group Hash Computation

Hi,
I have a field grouping based on 2 fields. I have 32 consumers for the
tuple and I see most of the times, out of 64 bolts, the field group is
always on 8 of them. Of the 8, 2 have more than 60% of the data. The data
for the field grouping can have 20 different combinations.

Do you know what is the way to compute the Hash of the fields used for
computing? One of the groups mails indicate that the approach is -

*It calls "hashCode" on the list of selected values and mods it by the *
*number of consumer tasks. You can play around with that function to see
if *
*something about your data is causing something degenerative to happen and *
*cause skew*

I saw the clojure code but not sure how to understand this.

Thanks
Kashyap

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

THis is interesting. We were mitigating it by avoid power of 2. Thanks
Florian.

On Tue, Oct 6, 2015 at 5:12 AM, Florian Hussonnois <fh...@gmail.com>
wrote:

> Hi Kashyap,
>
> You could improve your tuples distribution by implementing a
> CustomStreamGrouping.
> I have tried yours example with Murmur3 algorithm and the result looks
> better.
>
> Arrays.deepHashCode : [-35, -35, -35, -3, -3, -3, -3, 29, 29, 29, 29, 41,
> 51, 61, 61, 61, 61, 61]
> Murmur3 : [-61, -58, -57, -48, -37, -31, -15, -7, -4, 3, 6, 12, 20, 27,
> 45, 49, 56, 57]
>
> You can find my implementation here :
> https://github.com/fhussonnois/storm-cassandra/blob/master/src/main/java/com/github/fhuss/storm/cassandra/Murmur3StreamGrouping.java
>
> Hope this help.
>
> 2015-10-01 11:08 GMT+02:00 Matthias J. Sax <mj...@apache.org>:
>
>> The hash code will only be computed on the fields specified as grouping
>> attributes
>>
>> Thus, Values(str2,str3) will be used.
>>
>> The code is basically, Tuple.selectFields(groupingFiels).hashValue()
>>
>> -Matthias
>>
>> On 09/30/2015 04:05 PM, Kashyap Mhaisekar wrote:
>> > Thanks Matthias. My question was this -
>> > If am emitting out str1,str2,str3 but field grouped on str2,str3 only
>> > then will the hash be on Values(str1,str2,str3) or on Values(str2,str3)
>> > alone?. In my case str1,str2 are changing but I see the values go to
>> > same bolt instance. Can we debug what is the hash generated?
>> >
>> > Thanks you!
>> >
>> > Kashyap
>> >
>> > On Sep 30, 2015 5:14 AM, "Matthias J. Sax" <mjsax@apache.org
>> > <ma...@apache.org>> wrote:
>> >
>> >     Yes. That's right.
>> >
>> >     "Values" extends ArrayList and does not overwrite .hashCode().
>> >
>> >     -Matthias
>> >
>> >     On 09/30/2015 11:21 AM, Kashyap Mhaisekar wrote:
>> >     > Is the computation right for hash?
>> ArrayList(str1,str2...).hashcode()
>> >     > where str1,str2 etc are fields being grouped?
>> >     >
>> >     > Thanks
>> >     > Kashyap
>> >     >
>> >     > On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <kashyap.m@gmail.com
>> >     <ma...@gmail.com>
>> >     > <mailto:kashyap.m@gmail.com <ma...@gmail.com>>> wrote:
>> >     >
>> >     >     Thanks guys. From what I understand, partial key grouping is
>> used
>> >     >     when you know your grouping will create imbalance. In my case,
>> >     most
>> >     >     of my field groups to one bolt thereby causing it to be a
>> >     >     bottleneck. Since I emit string, I guess the hash is on
>> >     >     ArrayList(str1,str2...).hashcode(). This hashcode is coming
>> >     out same
>> >     >     for different string combinations...
>> >     >
>> >     >     Thanks
>> >     >     Kashyap
>> >     >
>> >     >     On Sep 29, 2015 17:51, "Matthias J. Sax" <mjsax@apache.org
>> >     <ma...@apache.org>
>> >     >     <mailto:mjsax@apache.org <ma...@apache.org>>> wrote:
>> >     >
>> >     >         If you can use "partial key grouping" depends on your use
>> >     case.
>> >     >         Think
>> >     >         careful before you apply it...
>> >     >
>> >     >         Maybe you want to read the research paper about it. It
>> clearly
>> >     >         describes
>> >     >         when you can use it and when not:
>> >     >
>> >
>> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>> >     >
>> >     >
>> >     >         -Matthias
>> >     >
>> >     >         On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>> >     >         > Hi,
>> >     >         >
>> >     >         > From what I read, the default FieldGrouping did not
>> balance
>> >     >         the load as
>> >     >         > like ShuffleGrouping do. In this case, there is a
>> >     discussion about
>> >     >         > custom Grouping implementation called partial key
>> grouping
>> >     >         where it have
>> >     >         > better balancing problem. Maybe it
>> >     >         > helps. https://github.com/gdfm/partial-key-grouping
>> >     >         >
>> >     >         > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar
>> >     >         <kashyap.m@gmail.com <ma...@gmail.com>
>> >     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
>> >     >         > <mailto:kashyap.m@gmail.com <mailto:kashyap.m@gmail.com
>> >
>> >     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>>> wrote:
>> >     >         >
>> >     >         >     Thanks Derek. I use strings and I still end up with
>> >     some bolts
>> >     >         >     having the maximum requests :(
>> >     >         >
>> >     >         >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit
>> >     >         <derekd@yahoo-inc.com <ma...@yahoo-inc.com>
>> >     <mailto:derekd@yahoo-inc.com <ma...@yahoo-inc.com>>
>> >     >         >     <mailto:derekd@yahoo-inc.com
>> >     <ma...@yahoo-inc.com>
>> >     >         <mailto:derekd@yahoo-inc.com
>> >     <ma...@yahoo-inc.com>>>> wrote:
>> >     >         >
>> >     >         >         The code that hashes the field values is here:
>> >     >         >
>> >     >         >
>> >     >
>> >
>> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>> >     >         >
>> >     >         >
>> >     >         >         You can write a little java program, something
>> like:
>> >     >         >
>> >     >         >         public static void main(String[] args) {
>> >     >         >           ArrayList<String> myList = new
>> >     ArrayList<String>();
>> >     >         >              myList.add("first field value");
>> >     >         >           myList.add("second field value");
>> >     >         >
>> >     >         >           int hash =
>> >     Arrays.deephashCode(myList.toArray()); //
>> >     >         as in
>> >     >         >         tuple.clj
>> >     >         >
>> >     >         >
>> >     >         >           System.out.println("hash is "+hash);
>> >     >         >           int numTasks = 32;
>> >     >         >
>> >     >         >           System.out.println("task index is " + hash %
>> >     numTasks);
>> >     >         >
>> >     >         >         }
>> >     >         >
>> >     >         >
>> >     >         >         There are certain types of values that may not
>> hash
>> >     >         >         consistently.  If you are using String values,
>> >     then it
>> >     >         should be
>> >     >         >         fine. Other types may or may not, depending on
>> >     how the
>> >     >         class
>> >     >         >         implements hashCode().
>> >     >         >
>> >     >         >
>> >     >         >         --
>> >     >         >         Derek
>> >     >         >
>> >     >         >
>> >     >         >         ________________________________
>> >     >         >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>> >     <ma...@gmail.com>
>> >     >         <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
>> >     >         >         <mailto:kashyap.m@gmail.com
>> >     <ma...@gmail.com> <mailto:kashyap.m@gmail.com
>> >     <ma...@gmail.com>>>>
>> >     >         >         To: user@storm.apache.org
>> >     <ma...@storm.apache.org>
>> >     >         <mailto:user@storm.apache.org
>> >     <ma...@storm.apache.org>> <mailto:user@storm.apache.org
>> >     <ma...@storm.apache.org>
>> >     >         <mailto:user@storm.apache.org <mailto:
>> user@storm.apache.org>>>
>> >     >         >         Sent: Tuesday, September 29, 2015 4:28 PM
>> >     >         >         Subject: Field Group Hash Computation
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >         >         Hi,
>> >     >         >         I have a field grouping based on 2 fields. I
>> have 32
>> >     >         consumers
>> >     >         >         for the tuple and I see most of the times, out
>> of 64
>> >     >         bolts, the
>> >     >         >         field group is always on 8 of them. Of the 8, 2
>> have
>> >     >         more than
>> >     >         >         60% of the data. The data for the field
>> grouping can
>> >     >         have 20
>> >     >         >         different combinations.
>> >     >         >
>> >     >         >         Do you know what is the way to compute the Hash
>> >     of the
>> >     >         fields
>> >     >         >         used for computing? One of the groups mails
>> indicate
>> >     >         that the
>> >     >         >         approach is -
>> >     >         >
>> >     >         >         It calls "hashCode" on the list of selected
>> >     values and
>> >     >         mods it
>> >     >         >         by the
>> >     >         >         number of consumer tasks. You can play around
>> with
>> >     >         that function
>> >     >         >         to see if
>> >     >         >         something about your data is causing something
>> >     >         degenerative to
>> >     >         >         happen and
>> >     >         >         cause skew
>> >     >         >
>> >     >         >         I saw the clojure code but not sure how to
>> >     understand
>> >     >         this.
>> >     >         >
>> >     >         >         Thanks
>> >     >         >         Kashyap
>> >     >         >
>> >     >         >
>> >     >         >
>> >     >
>> >
>>
>>
>
>
> --
> Florian HUSSONNOIS
>

Re: Field Group Hash Computation

Posted by Florian Hussonnois <fh...@gmail.com>.

Hi Kashyap,

You could improve your tuples distribution by implementing a
CustomStreamGrouping.
I have tried yours example with Murmur3 algorithm and the result looks
better.

Arrays.deepHashCode : [-35, -35, -35, -3, -3, -3, -3, 29, 29, 29, 29, 41,
51, 61, 61, 61, 61, 61]
Murmur3 : [-61, -58, -57, -48, -37, -31, -15, -7, -4, 3, 6, 12, 20, 27, 45,
49, 56, 57]

You can find my implementation here :
https://github.com/fhussonnois/storm-cassandra/blob/master/src/main/java/com/github/fhuss/storm/cassandra/Murmur3StreamGrouping.java

Hope this help.

2015-10-01 11:08 GMT+02:00 Matthias J. Sax <mj...@apache.org>:

> The hash code will only be computed on the fields specified as grouping
> attributes
>
> Thus, Values(str2,str3) will be used.
>
> The code is basically, Tuple.selectFields(groupingFiels).hashValue()
>
> -Matthias
>
> On 09/30/2015 04:05 PM, Kashyap Mhaisekar wrote:
> > Thanks Matthias. My question was this -
> > If am emitting out str1,str2,str3 but field grouped on str2,str3 only
> > then will the hash be on Values(str1,str2,str3) or on Values(str2,str3)
> > alone?. In my case str1,str2 are changing but I see the values go to
> > same bolt instance. Can we debug what is the hash generated?
> >
> > Thanks you!
> >
> > Kashyap
> >
> > On Sep 30, 2015 5:14 AM, "Matthias J. Sax" <mjsax@apache.org
> > <ma...@apache.org>> wrote:
> >
> >     Yes. That's right.
> >
> >     "Values" extends ArrayList and does not overwrite .hashCode().
> >
> >     -Matthias
> >
> >     On 09/30/2015 11:21 AM, Kashyap Mhaisekar wrote:
> >     > Is the computation right for hash?
> ArrayList(str1,str2...).hashcode()
> >     > where str1,str2 etc are fields being grouped?
> >     >
> >     > Thanks
> >     > Kashyap
> >     >
> >     > On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <kashyap.m@gmail.com
> >     <ma...@gmail.com>
> >     > <mailto:kashyap.m@gmail.com <ma...@gmail.com>>> wrote:
> >     >
> >     >     Thanks guys. From what I understand, partial key grouping is
> used
> >     >     when you know your grouping will create imbalance. In my case,
> >     most
> >     >     of my field groups to one bolt thereby causing it to be a
> >     >     bottleneck. Since I emit string, I guess the hash is on
> >     >     ArrayList(str1,str2...).hashcode(). This hashcode is coming
> >     out same
> >     >     for different string combinations...
> >     >
> >     >     Thanks
> >     >     Kashyap
> >     >
> >     >     On Sep 29, 2015 17:51, "Matthias J. Sax" <mjsax@apache.org
> >     <ma...@apache.org>
> >     >     <mailto:mjsax@apache.org <ma...@apache.org>>> wrote:
> >     >
> >     >         If you can use "partial key grouping" depends on your use
> >     case.
> >     >         Think
> >     >         careful before you apply it...
> >     >
> >     >         Maybe you want to read the research paper about it. It
> clearly
> >     >         describes
> >     >         when you can use it and when not:
> >     >
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >     >
> >     >
> >     >         -Matthias
> >     >
> >     >         On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> >     >         > Hi,
> >     >         >
> >     >         > From what I read, the default FieldGrouping did not
> balance
> >     >         the load as
> >     >         > like ShuffleGrouping do. In this case, there is a
> >     discussion about
> >     >         > custom Grouping implementation called partial key
> grouping
> >     >         where it have
> >     >         > better balancing problem. Maybe it
> >     >         > helps. https://github.com/gdfm/partial-key-grouping
> >     >         >
> >     >         > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar
> >     >         <kashyap.m@gmail.com <ma...@gmail.com>
> >     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
> >     >         > <mailto:kashyap.m@gmail.com <ma...@gmail.com>
> >     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>>> wrote:
> >     >         >
> >     >         >     Thanks Derek. I use strings and I still end up with
> >     some bolts
> >     >         >     having the maximum requests :(
> >     >         >
> >     >         >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit
> >     >         <derekd@yahoo-inc.com <ma...@yahoo-inc.com>
> >     <mailto:derekd@yahoo-inc.com <ma...@yahoo-inc.com>>
> >     >         >     <mailto:derekd@yahoo-inc.com
> >     <ma...@yahoo-inc.com>
> >     >         <mailto:derekd@yahoo-inc.com
> >     <ma...@yahoo-inc.com>>>> wrote:
> >     >         >
> >     >         >         The code that hashes the field values is here:
> >     >         >
> >     >         >
> >     >
> >
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> >     >         >
> >     >         >
> >     >         >         You can write a little java program, something
> like:
> >     >         >
> >     >         >         public static void main(String[] args) {
> >     >         >           ArrayList<String> myList = new
> >     ArrayList<String>();
> >     >         >              myList.add("first field value");
> >     >         >           myList.add("second field value");
> >     >         >
> >     >         >           int hash =
> >     Arrays.deephashCode(myList.toArray()); //
> >     >         as in
> >     >         >         tuple.clj
> >     >         >
> >     >         >
> >     >         >           System.out.println("hash is "+hash);
> >     >         >           int numTasks = 32;
> >     >         >
> >     >         >           System.out.println("task index is " + hash %
> >     numTasks);
> >     >         >
> >     >         >         }
> >     >         >
> >     >         >
> >     >         >         There are certain types of values that may not
> hash
> >     >         >         consistently.  If you are using String values,
> >     then it
> >     >         should be
> >     >         >         fine. Other types may or may not, depending on
> >     how the
> >     >         class
> >     >         >         implements hashCode().
> >     >         >
> >     >         >
> >     >         >         --
> >     >         >         Derek
> >     >         >
> >     >         >
> >     >         >         ________________________________
> >     >         >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
> >     <ma...@gmail.com>
> >     >         <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
> >     >         >         <mailto:kashyap.m@gmail.com
> >     <ma...@gmail.com> <mailto:kashyap.m@gmail.com
> >     <ma...@gmail.com>>>>
> >     >         >         To: user@storm.apache.org
> >     <ma...@storm.apache.org>
> >     >         <mailto:user@storm.apache.org
> >     <ma...@storm.apache.org>> <mailto:user@storm.apache.org
> >     <ma...@storm.apache.org>
> >     >         <mailto:user@storm.apache.org <mailto:
> user@storm.apache.org>>>
> >     >         >         Sent: Tuesday, September 29, 2015 4:28 PM
> >     >         >         Subject: Field Group Hash Computation
> >     >         >
> >     >         >
> >     >         >
> >     >         >         Hi,
> >     >         >         I have a field grouping based on 2 fields. I
> have 32
> >     >         consumers
> >     >         >         for the tuple and I see most of the times, out
> of 64
> >     >         bolts, the
> >     >         >         field group is always on 8 of them. Of the 8, 2
> have
> >     >         more than
> >     >         >         60% of the data. The data for the field grouping
> can
> >     >         have 20
> >     >         >         different combinations.
> >     >         >
> >     >         >         Do you know what is the way to compute the Hash
> >     of the
> >     >         fields
> >     >         >         used for computing? One of the groups mails
> indicate
> >     >         that the
> >     >         >         approach is -
> >     >         >
> >     >         >         It calls "hashCode" on the list of selected
> >     values and
> >     >         mods it
> >     >         >         by the
> >     >         >         number of consumer tasks. You can play around
> with
> >     >         that function
> >     >         >         to see if
> >     >         >         something about your data is causing something
> >     >         degenerative to
> >     >         >         happen and
> >     >         >         cause skew
> >     >         >
> >     >         >         I saw the clojure code but not sure how to
> >     understand
> >     >         this.
> >     >         >
> >     >         >         Thanks
> >     >         >         Kashyap
> >     >         >
> >     >         >
> >     >         >
> >     >
> >
>
>


-- 
Florian HUSSONNOIS

Re: Field Group Hash Computation

Posted by "Matthias J. Sax" <mj...@apache.org>.

The hash code will only be computed on the fields specified as grouping
attributes

Thus, Values(str2,str3) will be used.

The code is basically, Tuple.selectFields(groupingFiels).hashValue()

-Matthias

On 09/30/2015 04:05 PM, Kashyap Mhaisekar wrote:
> Thanks Matthias. My question was this -
> If am emitting out str1,str2,str3 but field grouped on str2,str3 only
> then will the hash be on Values(str1,str2,str3) or on Values(str2,str3)
> alone?. In my case str1,str2 are changing but I see the values go to
> same bolt instance. Can we debug what is the hash generated?
> 
> Thanks you!
> 
> Kashyap
> 
> On Sep 30, 2015 5:14 AM, "Matthias J. Sax" <mjsax@apache.org
> <ma...@apache.org>> wrote:
> 
>     Yes. That's right.
> 
>     "Values" extends ArrayList and does not overwrite .hashCode().
> 
>     -Matthias
> 
>     On 09/30/2015 11:21 AM, Kashyap Mhaisekar wrote:
>     > Is the computation right for hash? ArrayList(str1,str2...).hashcode()
>     > where str1,str2 etc are fields being grouped?
>     >
>     > Thanks
>     > Kashyap
>     >
>     > On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <kashyap.m@gmail.com
>     <ma...@gmail.com>
>     > <mailto:kashyap.m@gmail.com <ma...@gmail.com>>> wrote:
>     >
>     >     Thanks guys. From what I understand, partial key grouping is used
>     >     when you know your grouping will create imbalance. In my case,
>     most
>     >     of my field groups to one bolt thereby causing it to be a
>     >     bottleneck. Since I emit string, I guess the hash is on
>     >     ArrayList(str1,str2...).hashcode(). This hashcode is coming
>     out same
>     >     for different string combinations...
>     >
>     >     Thanks
>     >     Kashyap
>     >
>     >     On Sep 29, 2015 17:51, "Matthias J. Sax" <mjsax@apache.org
>     <ma...@apache.org>
>     >     <mailto:mjsax@apache.org <ma...@apache.org>>> wrote:
>     >
>     >         If you can use "partial key grouping" depends on your use
>     case.
>     >         Think
>     >         careful before you apply it...
>     >
>     >         Maybe you want to read the research paper about it. It clearly
>     >         describes
>     >         when you can use it and when not:
>     >       
>      https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>     >
>     >
>     >         -Matthias
>     >
>     >         On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>     >         > Hi,
>     >         >
>     >         > From what I read, the default FieldGrouping did not balance
>     >         the load as
>     >         > like ShuffleGrouping do. In this case, there is a
>     discussion about
>     >         > custom Grouping implementation called partial key grouping
>     >         where it have
>     >         > better balancing problem. Maybe it
>     >         > helps. https://github.com/gdfm/partial-key-grouping
>     >         >
>     >         > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar
>     >         <kashyap.m@gmail.com <ma...@gmail.com>
>     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
>     >         > <mailto:kashyap.m@gmail.com <ma...@gmail.com>
>     <mailto:kashyap.m@gmail.com <ma...@gmail.com>>>> wrote:
>     >         >
>     >         >     Thanks Derek. I use strings and I still end up with
>     some bolts
>     >         >     having the maximum requests :(
>     >         >
>     >         >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit
>     >         <derekd@yahoo-inc.com <ma...@yahoo-inc.com>
>     <mailto:derekd@yahoo-inc.com <ma...@yahoo-inc.com>>
>     >         >     <mailto:derekd@yahoo-inc.com
>     <ma...@yahoo-inc.com>
>     >         <mailto:derekd@yahoo-inc.com
>     <ma...@yahoo-inc.com>>>> wrote:
>     >         >
>     >         >         The code that hashes the field values is here:
>     >         >
>     >         >
>     >         
>     https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>     >         >
>     >         >
>     >         >         You can write a little java program, something like:
>     >         >
>     >         >         public static void main(String[] args) {
>     >         >           ArrayList<String> myList = new
>     ArrayList<String>();
>     >         >              myList.add("first field value");
>     >         >           myList.add("second field value");
>     >         >
>     >         >           int hash =
>     Arrays.deephashCode(myList.toArray()); //
>     >         as in
>     >         >         tuple.clj
>     >         >
>     >         >
>     >         >           System.out.println("hash is "+hash);
>     >         >           int numTasks = 32;
>     >         >
>     >         >           System.out.println("task index is " + hash %
>     numTasks);
>     >         >
>     >         >         }
>     >         >
>     >         >
>     >         >         There are certain types of values that may not hash
>     >         >         consistently.  If you are using String values,
>     then it
>     >         should be
>     >         >         fine. Other types may or may not, depending on
>     how the
>     >         class
>     >         >         implements hashCode().
>     >         >
>     >         >
>     >         >         --
>     >         >         Derek
>     >         >
>     >         >
>     >         >         ________________________________
>     >         >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>     <ma...@gmail.com>
>     >         <mailto:kashyap.m@gmail.com <ma...@gmail.com>>
>     >         >         <mailto:kashyap.m@gmail.com
>     <ma...@gmail.com> <mailto:kashyap.m@gmail.com
>     <ma...@gmail.com>>>>
>     >         >         To: user@storm.apache.org
>     <ma...@storm.apache.org>
>     >         <mailto:user@storm.apache.org
>     <ma...@storm.apache.org>> <mailto:user@storm.apache.org
>     <ma...@storm.apache.org>
>     >         <mailto:user@storm.apache.org <ma...@storm.apache.org>>>
>     >         >         Sent: Tuesday, September 29, 2015 4:28 PM
>     >         >         Subject: Field Group Hash Computation
>     >         >
>     >         >
>     >         >
>     >         >         Hi,
>     >         >         I have a field grouping based on 2 fields. I have 32
>     >         consumers
>     >         >         for the tuple and I see most of the times, out of 64
>     >         bolts, the
>     >         >         field group is always on 8 of them. Of the 8, 2 have
>     >         more than
>     >         >         60% of the data. The data for the field grouping can
>     >         have 20
>     >         >         different combinations.
>     >         >
>     >         >         Do you know what is the way to compute the Hash
>     of the
>     >         fields
>     >         >         used for computing? One of the groups mails indicate
>     >         that the
>     >         >         approach is -
>     >         >
>     >         >         It calls "hashCode" on the list of selected
>     values and
>     >         mods it
>     >         >         by the
>     >         >         number of consumer tasks. You can play around with
>     >         that function
>     >         >         to see if
>     >         >         something about your data is causing something
>     >         degenerative to
>     >         >         happen and
>     >         >         cause skew
>     >         >
>     >         >         I saw the clojure code but not sure how to
>     understand
>     >         this.
>     >         >
>     >         >         Thanks
>     >         >         Kashyap
>     >         >
>     >         >
>     >         >
>     >
>

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks Matthias. My question was this -
If am emitting out str1,str2,str3 but field grouped on str2,str3 only then
will the hash be on Values(str1,str2,str3) or on Values(str2,str3) alone?.
In my case str1,str2 are changing but I see the values go to same bolt
instance. Can we debug what is the hash generated?

Thanks you!

Kashyap
On Sep 30, 2015 5:14 AM, "Matthias J. Sax" <mj...@apache.org> wrote:

> Yes. That's right.
>
> "Values" extends ArrayList and does not overwrite .hashCode().
>
> -Matthias
>
> On 09/30/2015 11:21 AM, Kashyap Mhaisekar wrote:
> > Is the computation right for hash? ArrayList(str1,str2...).hashcode()
> > where str1,str2 etc are fields being grouped?
> >
> > Thanks
> > Kashyap
> >
> > On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <kashyap.m@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Thanks guys. From what I understand, partial key grouping is used
> >     when you know your grouping will create imbalance. In my case, most
> >     of my field groups to one bolt thereby causing it to be a
> >     bottleneck. Since I emit string, I guess the hash is on
> >     ArrayList(str1,str2...).hashcode(). This hashcode is coming out same
> >     for different string combinations...
> >
> >     Thanks
> >     Kashyap
> >
> >     On Sep 29, 2015 17:51, "Matthias J. Sax" <mjsax@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >         If you can use "partial key grouping" depends on your use case.
> >         Think
> >         careful before you apply it...
> >
> >         Maybe you want to read the research paper about it. It clearly
> >         describes
> >         when you can use it and when not:
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >
> >
> >         -Matthias
> >
> >         On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> >         > Hi,
> >         >
> >         > From what I read, the default FieldGrouping did not balance
> >         the load as
> >         > like ShuffleGrouping do. In this case, there is a discussion
> about
> >         > custom Grouping implementation called partial key grouping
> >         where it have
> >         > better balancing problem. Maybe it
> >         > helps. https://github.com/gdfm/partial-key-grouping
> >         >
> >         > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar
> >         <kashyap.m@gmail.com <ma...@gmail.com>
> >         > <mailto:kashyap.m@gmail.com <ma...@gmail.com>>>
> wrote:
> >         >
> >         >     Thanks Derek. I use strings and I still end up with some
> bolts
> >         >     having the maximum requests :(
> >         >
> >         >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit
> >         <derekd@yahoo-inc.com <ma...@yahoo-inc.com>
> >         >     <mailto:derekd@yahoo-inc.com
> >         <ma...@yahoo-inc.com>>> wrote:
> >         >
> >         >         The code that hashes the field values is here:
> >         >
> >         >
> >
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> >         >
> >         >
> >         >         You can write a little java program, something like:
> >         >
> >         >         public static void main(String[] args) {
> >         >           ArrayList<String> myList = new ArrayList<String>();
> >         >              myList.add("first field value");
> >         >           myList.add("second field value");
> >         >
> >         >           int hash = Arrays.deephashCode(myList.toArray()); //
> >         as in
> >         >         tuple.clj
> >         >
> >         >
> >         >           System.out.println("hash is "+hash);
> >         >           int numTasks = 32;
> >         >
> >         >           System.out.println("task index is " + hash %
> numTasks);
> >         >
> >         >         }
> >         >
> >         >
> >         >         There are certain types of values that may not hash
> >         >         consistently.  If you are using String values, then it
> >         should be
> >         >         fine. Other types may or may not, depending on how the
> >         class
> >         >         implements hashCode().
> >         >
> >         >
> >         >         --
> >         >         Derek
> >         >
> >         >
> >         >         ________________________________
> >         >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
> >         <ma...@gmail.com>
> >         >         <mailto:kashyap.m@gmail.com <mailto:
> kashyap.m@gmail.com>>>
> >         >         To: user@storm.apache.org
> >         <ma...@storm.apache.org> <mailto:user@storm.apache.org
> >         <ma...@storm.apache.org>>
> >         >         Sent: Tuesday, September 29, 2015 4:28 PM
> >         >         Subject: Field Group Hash Computation
> >         >
> >         >
> >         >
> >         >         Hi,
> >         >         I have a field grouping based on 2 fields. I have 32
> >         consumers
> >         >         for the tuple and I see most of the times, out of 64
> >         bolts, the
> >         >         field group is always on 8 of them. Of the 8, 2 have
> >         more than
> >         >         60% of the data. The data for the field grouping can
> >         have 20
> >         >         different combinations.
> >         >
> >         >         Do you know what is the way to compute the Hash of the
> >         fields
> >         >         used for computing? One of the groups mails indicate
> >         that the
> >         >         approach is -
> >         >
> >         >         It calls "hashCode" on the list of selected values and
> >         mods it
> >         >         by the
> >         >         number of consumer tasks. You can play around with
> >         that function
> >         >         to see if
> >         >         something about your data is causing something
> >         degenerative to
> >         >         happen and
> >         >         cause skew
> >         >
> >         >         I saw the clojure code but not sure how to understand
> >         this.
> >         >
> >         >         Thanks
> >         >         Kashyap
> >         >
> >         >
> >         >
> >
>
>

Re: Field Group Hash Computation

Posted by "Matthias J. Sax" <mj...@apache.org>.

Yes. That's right.

"Values" extends ArrayList and does not overwrite .hashCode().

-Matthias

On 09/30/2015 11:21 AM, Kashyap Mhaisekar wrote:
> Is the computation right for hash? ArrayList(str1,str2...).hashcode()
> where str1,str2 etc are fields being grouped?
> 
> Thanks
> Kashyap
> 
> On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <kashyap.m@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Thanks guys. From what I understand, partial key grouping is used
>     when you know your grouping will create imbalance. In my case, most
>     of my field groups to one bolt thereby causing it to be a
>     bottleneck. Since I emit string, I guess the hash is on
>     ArrayList(str1,str2...).hashcode(). This hashcode is coming out same
>     for different string combinations...
> 
>     Thanks
>     Kashyap
> 
>     On Sep 29, 2015 17:51, "Matthias J. Sax" <mjsax@apache.org
>     <ma...@apache.org>> wrote:
> 
>         If you can use "partial key grouping" depends on your use case.
>         Think
>         careful before you apply it...
> 
>         Maybe you want to read the research paper about it. It clearly
>         describes
>         when you can use it and when not:
>         https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> 
> 
>         -Matthias
> 
>         On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>         > Hi,
>         >
>         > From what I read, the default FieldGrouping did not balance
>         the load as
>         > like ShuffleGrouping do. In this case, there is a discussion about
>         > custom Grouping implementation called partial key grouping
>         where it have
>         > better balancing problem. Maybe it
>         > helps. https://github.com/gdfm/partial-key-grouping
>         >
>         > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar
>         <kashyap.m@gmail.com <ma...@gmail.com>
>         > <mailto:kashyap.m@gmail.com <ma...@gmail.com>>> wrote:
>         >
>         >     Thanks Derek. I use strings and I still end up with some bolts
>         >     having the maximum requests :(
>         >
>         >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit
>         <derekd@yahoo-inc.com <ma...@yahoo-inc.com>
>         >     <mailto:derekd@yahoo-inc.com
>         <ma...@yahoo-inc.com>>> wrote:
>         >
>         >         The code that hashes the field values is here:
>         >
>         >       
>          https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>         >
>         >
>         >         You can write a little java program, something like:
>         >
>         >         public static void main(String[] args) {
>         >           ArrayList<String> myList = new ArrayList<String>();
>         >              myList.add("first field value");
>         >           myList.add("second field value");
>         >
>         >           int hash = Arrays.deephashCode(myList.toArray()); //
>         as in
>         >         tuple.clj
>         >
>         >
>         >           System.out.println("hash is "+hash);
>         >           int numTasks = 32;
>         >
>         >           System.out.println("task index is " + hash % numTasks);
>         >
>         >         }
>         >
>         >
>         >         There are certain types of values that may not hash
>         >         consistently.  If you are using String values, then it
>         should be
>         >         fine. Other types may or may not, depending on how the
>         class
>         >         implements hashCode().
>         >
>         >
>         >         --
>         >         Derek
>         >
>         >
>         >         ________________________________
>         >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>         <ma...@gmail.com>
>         >         <mailto:kashyap.m@gmail.com <ma...@gmail.com>>>
>         >         To: user@storm.apache.org
>         <ma...@storm.apache.org> <mailto:user@storm.apache.org
>         <ma...@storm.apache.org>>
>         >         Sent: Tuesday, September 29, 2015 4:28 PM
>         >         Subject: Field Group Hash Computation
>         >
>         >
>         >
>         >         Hi,
>         >         I have a field grouping based on 2 fields. I have 32
>         consumers
>         >         for the tuple and I see most of the times, out of 64
>         bolts, the
>         >         field group is always on 8 of them. Of the 8, 2 have
>         more than
>         >         60% of the data. The data for the field grouping can
>         have 20
>         >         different combinations.
>         >
>         >         Do you know what is the way to compute the Hash of the
>         fields
>         >         used for computing? One of the groups mails indicate
>         that the
>         >         approach is -
>         >
>         >         It calls "hashCode" on the list of selected values and
>         mods it
>         >         by the
>         >         number of consumer tasks. You can play around with
>         that function
>         >         to see if
>         >         something about your data is causing something
>         degenerative to
>         >         happen and
>         >         cause skew
>         >
>         >         I saw the clojure code but not sure how to understand
>         this.
>         >
>         >         Thanks
>         >         Kashyap
>         >
>         >
>         >
>

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Is the computation right for hash? ArrayList(str1,str2...).hashcode() where
str1,str2 etc are fields being grouped?

Thanks
Kashyap
On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <ka...@gmail.com> wrote:

> Thanks guys. From what I understand, partial key grouping is used when you
> know your grouping will create imbalance. In my case, most of my field
> groups to one bolt thereby causing it to be a bottleneck. Since I emit
> string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This
> hashcode is coming out same for different string combinations...
>
> Thanks
> Kashyap
> On Sep 29, 2015 17:51, "Matthias J. Sax" <mj...@apache.org> wrote:
>
>> If you can use "partial key grouping" depends on your use case. Think
>> careful before you apply it...
>>
>> Maybe you want to read the research paper about it. It clearly describes
>> when you can use it and when not:
>>
>> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>>
>>
>> -Matthias
>>
>> On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>> > Hi,
>> >
>> > From what I read, the default FieldGrouping did not balance the load as
>> > like ShuffleGrouping do. In this case, there is a discussion about
>> > custom Grouping implementation called partial key grouping where it have
>> > better balancing problem. Maybe it
>> > helps. https://github.com/gdfm/partial-key-grouping
>> >
>> > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <
>> kashyap.m@gmail.com
>> > <ma...@gmail.com>> wrote:
>> >
>> >     Thanks Derek. I use strings and I still end up with some bolts
>> >     having the maximum requests :(
>> >
>> >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
>> >     <ma...@yahoo-inc.com>> wrote:
>> >
>> >         The code that hashes the field values is here:
>> >
>> >
>> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>> >
>> >
>> >         You can write a little java program, something like:
>> >
>> >         public static void main(String[] args) {
>> >           ArrayList<String> myList = new ArrayList<String>();
>> >              myList.add("first field value");
>> >           myList.add("second field value");
>> >
>> >           int hash = Arrays.deephashCode(myList.toArray()); // as in
>> >         tuple.clj
>> >
>> >
>> >           System.out.println("hash is "+hash);
>> >           int numTasks = 32;
>> >
>> >           System.out.println("task index is " + hash % numTasks);
>> >
>> >         }
>> >
>> >
>> >         There are certain types of values that may not hash
>> >         consistently.  If you are using String values, then it should be
>> >         fine. Other types may or may not, depending on how the class
>> >         implements hashCode().
>> >
>> >
>> >         --
>> >         Derek
>> >
>> >
>> >         ________________________________
>> >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>> >         <ma...@gmail.com>>
>> >         To: user@storm.apache.org <ma...@storm.apache.org>
>> >         Sent: Tuesday, September 29, 2015 4:28 PM
>> >         Subject: Field Group Hash Computation
>> >
>> >
>> >
>> >         Hi,
>> >         I have a field grouping based on 2 fields. I have 32 consumers
>> >         for the tuple and I see most of the times, out of 64 bolts, the
>> >         field group is always on 8 of them. Of the 8, 2 have more than
>> >         60% of the data. The data for the field grouping can have 20
>> >         different combinations.
>> >
>> >         Do you know what is the way to compute the Hash of the fields
>> >         used for computing? One of the groups mails indicate that the
>> >         approach is -
>> >
>> >         It calls "hashCode" on the list of selected values and mods it
>> >         by the
>> >         number of consumer tasks. You can play around with that function
>> >         to see if
>> >         something about your data is causing something degenerative to
>> >         happen and
>> >         cause skew
>> >
>> >         I saw the clojure code but not sure how to understand this.
>> >
>> >         Thanks
>> >         Kashyap
>> >
>> >
>> >
>>
>>

Re: Field Group Hash Computation

Posted by Derek Dagit <de...@yahoo-inc.com>.

OK, it looks like you have a very unlucky number of tasks.

For each of your string values s, taking the Arrays#deepHashCode of the List of s gives integers that are very poorly distributed over 64.


user=> (def l '("0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" ))

user=> (def num-tasks 64)

user=> (defn f [^List l] (-> l t/list-hash-code (mod num-tasks)))

(41 51 29 61 61 29 29 61 61 29 29 61 61 29 29 61 61 61)

Half of these land on 61, almost half land on 29, and one lands on 41.


If the number of tasks is a nearby 65:

user=> (def num-tasks 65)

user=> (sort (for [x l] (f (list x))))
(1 7 8 14 14 20 21 27 28 34 39 47 52 53 54 58 59 61)

Only 14 occurs twice.



It seems your number of tasks is an unlucky modulo divisor.

 
-- 
Derek


________________________________
 From: Kashyap Mhaisekar <ka...@gmail.com>
To: user@storm.apache.org; Derek Dagit <de...@yahoo-inc.com> 
Sent: Wednesday, September 30, 2015 10:18 AM
Subject: Re: Field Group Hash Computation
 


Thanks Derek. Here is the code and the results.
When the string is added to an ArrayList and then (hashCode % 64) is computed they come out same. 64 is the no. of consumer tasks. The hashcode of the strings by themselves is different.

My emit emits as -
collector.emit(new Values(str1,str2,str3)) where str3 is field grouped and has the string values in "arr" in the below program

---------------
package com.demo;

import java.util.ArrayList;
import java.util.Random;

public class HashTest {

public static void main(String[] args) {

String[] arr = { "0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" };

int tasks = 64;//
for (int i = 0; i < arr.length; i++) {
ArrayList<String> arl = new ArrayList<String>();
arl.add(arr[i]);

System.out.println("Hash: " + arr[i] + " -- (hash): "
+ (arl.hashCode()%tasks) + " -- (String's hashcode): " + arr[i].hashCode());
}
}
}

Results:
Hash: 0:499 -- (hash): 41 -- (String's hashcode): 46108682
Hash: 500:999 -- (hash): 51 -- (String's hashcode): 1213367572
Hash: 1000:1499 -- (hash): 29 -- (String's hashcode): 464373438
Hash: 1500:1999 -- (hash): 61 -- (String's hashcode): 588495326
Hash: 2000:2499 -- (hash): -3 -- (String's hashcode): -1343051234
Hash: 2500:2999 -- (hash): -35 -- (String's hashcode): -1218929346
Hash: 3000:3499 -- (hash): 29 -- (String's hashcode): 1144491390
Hash: 3500:3999 -- (hash): 61 -- (String's hashcode): 1268613278
Hash: 4000:4499 -- (hash): -3 -- (String's hashcode): -662933282
Hash: 4500:4999 -- (hash): -35 -- (String's hashcode): -538811394
Hash: 5000:5499 -- (hash): 29 -- (String's hashcode): 1824609342
Hash: 5500:5999 -- (hash): 61 -- (String's hashcode): 1948731230
Hash: 6000:6499 -- (hash): 61 -- (String's hashcode): 17184670
Hash: 6500:6999 -- (hash): 29 -- (String's hashcode): 141306558
Hash: 7000:7499 -- (hash): -35 -- (String's hashcode): -1790240002
Hash: 7500:7999 -- (hash): -3 -- (String's hashcode): -1666118114
Hash: 8000:8499 -- (hash): 61 -- (String's hashcode): 697302622
Hash: 9500:9999 -- (hash): -3 -- (String's hashcode): -986000162

----------------------

Thanks
kashyap




On Wed, Sep 30, 2015 at 9:20 AM, Derek Dagit <de...@yahoo-inc.com> wrote:

> This hashcode is coming out same for different string combinations...
>
>As far as I understand, this can only happen with vanishingly small probability.
>
>Here is the hashCode implementation for String:
>http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29
>
>Here is the Arrays code that combines the hashes of the individual Strings:
>http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29
>
>
>
>Would you share an example of different combinations of String field values that hash to the same hashcode value?
>--
>Derek
>
>
>________________________________
>From: Kashyap Mhaisekar <ka...@gmail.com>
>To: user@storm.apache.org
>Sent: Tuesday, September 29, 2015 6:04 PM
>Subject: Re: Field Group Hash Computation
>
>
>
>
>Thanks guys. From what I understand, partial key grouping is used when you know your grouping will create imbalance. In my case, most of my field groups to one bolt thereby causing it to be a bottleneck. Since I emit string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This hashcode is coming out same for different string combinations...
>Thanks
>Kashyap
>
>
>On Sep 29, 2015 17:51, "Matthias J. Sax" <mj...@apache.org> wrote:
>
>If you can use "partial key grouping" depends on your use case. Think
>>careful before you apply it...
>>
>>Maybe you want to read the research paper about it. It clearly describes
>>when you can use it and when not:
>>https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>>
>>
>>-Matthias
>>
>>On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>>> Hi,
>>>
>>> From what I read, the default FieldGrouping did not balance the load as
>>> like ShuffleGrouping do. In this case, there is a discussion about
>>> custom Grouping implementation called partial key grouping where it have
>>> better balancing problem. Maybe it
>>> helps. https://github.com/gdfm/partial-key-grouping
>>>
>>> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <kashyap.m@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Thanks Derek. I use strings and I still end up with some bolts
>>>     having the maximum requests :(
>>>
>>>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
>>>     <ma...@yahoo-inc.com>> wrote:
>>>
>>>         The code that hashes the field values is here:
>>>
>>>         https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>>>
>>>
>>>         You can write a little java program, something like:
>>>
>>>         public static void main(String[] args) {
>>>           ArrayList<String> myList = new ArrayList<String>();
>>>              myList.add("first field value");
>>>           myList.add("second field value");
>>>
>>>           int hash = Arrays.deephashCode(myList.toArray()); // as in
>>>         tuple.clj
>>>
>>>
>>>           System.out.println("hash is "+hash);
>>>           int numTasks = 32;
>>>
>>>           System.out.println("task index is " + hash % numTasks);
>>>
>>>         }
>>>
>>>
>>>         There are certain types of values that may not hash
>>>         consistently.  If you are using String values, then it should be
>>>         fine. Other types may or may not, depending on how the class
>>>         implements hashCode().
>>>
>>>
>>>         --
>>>         Derek
>>>
>>>
>>>         ________________________________
>>>         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>>>         <ma...@gmail.com>>
>>>         To: user@storm.apache.org <ma...@storm.apache.org>
>>>         Sent: Tuesday, September 29, 2015 4:28 PM
>>>         Subject: Field Group Hash Computation
>>>
>>>
>>>
>>>         Hi,
>>>         I have a field grouping based on 2 fields. I have 32 consumers
>>>         for the tuple and I see most of the times, out of 64 bolts, the
>>>         field group is always on 8 of them. Of the 8, 2 have more than
>>>         60% of the data. The data for the field grouping can have 20
>>>         different combinations.
>>>
>>>         Do you know what is the way to compute the Hash of the fields
>>>         used for computing? One of the groups mails indicate that the
>>>         approach is -
>>>
>>>         It calls "hashCode" on the list of selected values and mods it
>>>         by the
>>>         number of consumer tasks. You can play around with that function
>>>         to see if
>>>         something about your data is causing something degenerative to
>>>         happen and
>>>         cause skew
>>>
>>>         I saw the clojure code but not sure how to understand this.
>>>
>>>         Thanks
>>>         Kashyap
>>>
>>>
>>>
>>
>>
>

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks Derek. Here is the code and the results.
When the string is added to an ArrayList and then (hashCode % 64) is
computed they come out same. 64 is the no. of consumer tasks. The hashcode
of the strings by themselves is different.

My emit emits as -
collector.emit(new Values(str1,str2,str3)) where str3 is field grouped and
has the string values in "arr" in the below program
---------------
package com.demo;

import java.util.ArrayList;
import java.util.Random;

public class HashTest {

public static void main(String[] args) {

String[] arr = { "0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" };

int tasks = 64;//
for (int i = 0; i < arr.length; i++) {
ArrayList<String> arl = new ArrayList<String>();
arl.add(arr[i]);

System.out.println("Hash: " + arr[i] + " -- (hash): "
+ (arl.hashCode()%tasks) + " -- (String's hashcode): " + arr[i].hashCode());
}
}
}

Results:
Hash: 0:499 -- (hash): 41 -- (String's hashcode): 46108682
Hash: 500:999 -- (hash): 51 -- (String's hashcode): 1213367572
Hash: 1000:1499 -- (hash): 29 -- (String's hashcode): 464373438
Hash: 1500:1999 -- (hash): 61 -- (String's hashcode): 588495326
Hash: 2000:2499 -- (hash): -3 -- (String's hashcode): -1343051234
Hash: 2500:2999 -- (hash): -35 -- (String's hashcode): -1218929346
Hash: 3000:3499 -- (hash): 29 -- (String's hashcode): 1144491390
Hash: 3500:3999 -- (hash): 61 -- (String's hashcode): 1268613278
Hash: 4000:4499 -- (hash): -3 -- (String's hashcode): -662933282
Hash: 4500:4999 -- (hash): -35 -- (String's hashcode): -538811394
Hash: 5000:5499 -- (hash): 29 -- (String's hashcode): 1824609342
Hash: 5500:5999 -- (hash): 61 -- (String's hashcode): 1948731230
Hash: 6000:6499 -- (hash): 61 -- (String's hashcode): 17184670
Hash: 6500:6999 -- (hash): 29 -- (String's hashcode): 141306558
Hash: 7000:7499 -- (hash): -35 -- (String's hashcode): -1790240002
Hash: 7500:7999 -- (hash): -3 -- (String's hashcode): -1666118114
Hash: 8000:8499 -- (hash): 61 -- (String's hashcode): 697302622
Hash: 9500:9999 -- (hash): -3 -- (String's hashcode): -986000162

----------------------

Thanks
kashyap

On Wed, Sep 30, 2015 at 9:20 AM, Derek Dagit <de...@yahoo-inc.com> wrote:

> > This hashcode is coming out same for different string combinations...
>
> As far as I understand, this can only happen with vanishingly small
> probability.
>
> Here is the hashCode implementation for String:
>
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29
>
> Here is the Arrays code that combines the hashes of the individual Strings:
>
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29
>
>
>
> Would you share an example of different combinations of String field
> values that hash to the same hashcode value?
> --
> Derek
>
>
> ________________________________
> From: Kashyap Mhaisekar <ka...@gmail.com>
> To: user@storm.apache.org
> Sent: Tuesday, September 29, 2015 6:04 PM
> Subject: Re: Field Group Hash Computation
>
>
>
> Thanks guys. From what I understand, partial key grouping is used when you
> know your grouping will create imbalance. In my case, most of my field
> groups to one bolt thereby causing it to be a bottleneck. Since I emit
> string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This
> hashcode is coming out same for different string combinations...
> Thanks
> Kashyap
>
>
> On Sep 29, 2015 17:51, "Matthias J. Sax" <mj...@apache.org> wrote:
>
> If you can use "partial key grouping" depends on your use case. Think
> >careful before you apply it...
> >
> >Maybe you want to read the research paper about it. It clearly describes
> >when you can use it and when not:
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >
> >
> >-Matthias
> >
> >On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> >> Hi,
> >>
> >> From what I read, the default FieldGrouping did not balance the load as
> >> like ShuffleGrouping do. In this case, there is a discussion about
> >> custom Grouping implementation called partial key grouping where it have
> >> better balancing problem. Maybe it
> >> helps. https://github.com/gdfm/partial-key-grouping
> >>
> >> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <
> kashyap.m@gmail.com
> >> <ma...@gmail.com>> wrote:
> >>
> >>     Thanks Derek. I use strings and I still end up with some bolts
> >>     having the maximum requests :(
> >>
> >>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
> >>     <ma...@yahoo-inc.com>> wrote:
> >>
> >>         The code that hashes the field values is here:
> >>
> >>
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> >>
> >>
> >>         You can write a little java program, something like:
> >>
> >>         public static void main(String[] args) {
> >>           ArrayList<String> myList = new ArrayList<String>();
> >>              myList.add("first field value");
> >>           myList.add("second field value");
> >>
> >>           int hash = Arrays.deephashCode(myList.toArray()); // as in
> >>         tuple.clj
> >>
> >>
> >>           System.out.println("hash is "+hash);
> >>           int numTasks = 32;
> >>
> >>           System.out.println("task index is " + hash % numTasks);
> >>
> >>         }
> >>
> >>
> >>         There are certain types of values that may not hash
> >>         consistently.  If you are using String values, then it should be
> >>         fine. Other types may or may not, depending on how the class
> >>         implements hashCode().
> >>
> >>
> >>         --
> >>         Derek
> >>
> >>
> >>         ________________________________
> >>         From: Kashyap Mhaisekar <kashyap.m@gmail.com
> >>         <ma...@gmail.com>>
> >>         To: user@storm.apache.org <ma...@storm.apache.org>
> >>         Sent: Tuesday, September 29, 2015 4:28 PM
> >>         Subject: Field Group Hash Computation
> >>
> >>
> >>
> >>         Hi,
> >>         I have a field grouping based on 2 fields. I have 32 consumers
> >>         for the tuple and I see most of the times, out of 64 bolts, the
> >>         field group is always on 8 of them. Of the 8, 2 have more than
> >>         60% of the data. The data for the field grouping can have 20
> >>         different combinations.
> >>
> >>         Do you know what is the way to compute the Hash of the fields
> >>         used for computing? One of the groups mails indicate that the
> >>         approach is -
> >>
> >>         It calls "hashCode" on the list of selected values and mods it
> >>         by the
> >>         number of consumer tasks. You can play around with that function
> >>         to see if
> >>         something about your data is causing something degenerative to
> >>         happen and
> >>         cause skew
> >>
> >>         I saw the clojure code but not sure how to understand this.
> >>
> >>         Thanks
> >>         Kashyap
> >>
> >>
> >>
> >
> >
>

Re: Field Group Hash Computation

Posted by Derek Dagit <de...@yahoo-inc.com>.

> This hashcode is coming out same for different string combinations...

As far as I understand, this can only happen with vanishingly small probability.

Here is the hashCode implementation for String: 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29

Here is the Arrays code that combines the hashes of the individual Strings:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29



Would you share an example of different combinations of String field values that hash to the same hashcode value? 
-- 
Derek


________________________________
From: Kashyap Mhaisekar <ka...@gmail.com>
To: user@storm.apache.org 
Sent: Tuesday, September 29, 2015 6:04 PM
Subject: Re: Field Group Hash Computation



Thanks guys. From what I understand, partial key grouping is used when you know your grouping will create imbalance. In my case, most of my field groups to one bolt thereby causing it to be a bottleneck. Since I emit string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This hashcode is coming out same for different string combinations...
Thanks
Kashyap


On Sep 29, 2015 17:51, "Matthias J. Sax" <mj...@apache.org> wrote:

If you can use "partial key grouping" depends on your use case. Think
>careful before you apply it...
>
>Maybe you want to read the research paper about it. It clearly describes
>when you can use it and when not:
>https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>
>
>-Matthias
>
>On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>> Hi,
>>
>> From what I read, the default FieldGrouping did not balance the load as
>> like ShuffleGrouping do. In this case, there is a discussion about
>> custom Grouping implementation called partial key grouping where it have
>> better balancing problem. Maybe it
>> helps. https://github.com/gdfm/partial-key-grouping
>>
>> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <kashyap.m@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Thanks Derek. I use strings and I still end up with some bolts
>>     having the maximum requests :(
>>
>>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
>>     <ma...@yahoo-inc.com>> wrote:
>>
>>         The code that hashes the field values is here:
>>
>>         https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>>
>>
>>         You can write a little java program, something like:
>>
>>         public static void main(String[] args) {
>>           ArrayList<String> myList = new ArrayList<String>();
>>              myList.add("first field value");
>>           myList.add("second field value");
>>
>>           int hash = Arrays.deephashCode(myList.toArray()); // as in
>>         tuple.clj
>>
>>
>>           System.out.println("hash is "+hash);
>>           int numTasks = 32;
>>
>>           System.out.println("task index is " + hash % numTasks);
>>
>>         }
>>
>>
>>         There are certain types of values that may not hash
>>         consistently.  If you are using String values, then it should be
>>         fine. Other types may or may not, depending on how the class
>>         implements hashCode().
>>
>>
>>         --
>>         Derek
>>
>>
>>         ________________________________
>>         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>>         <ma...@gmail.com>>
>>         To: user@storm.apache.org <ma...@storm.apache.org>
>>         Sent: Tuesday, September 29, 2015 4:28 PM
>>         Subject: Field Group Hash Computation
>>
>>
>>
>>         Hi,
>>         I have a field grouping based on 2 fields. I have 32 consumers
>>         for the tuple and I see most of the times, out of 64 bolts, the
>>         field group is always on 8 of them. Of the 8, 2 have more than
>>         60% of the data. The data for the field grouping can have 20
>>         different combinations.
>>
>>         Do you know what is the way to compute the Hash of the fields
>>         used for computing? One of the groups mails indicate that the
>>         approach is -
>>
>>         It calls "hashCode" on the list of selected values and mods it
>>         by the
>>         number of consumer tasks. You can play around with that function
>>         to see if
>>         something about your data is causing something degenerative to
>>         happen and
>>         cause skew
>>
>>         I saw the clojure code but not sure how to understand this.
>>
>>         Thanks
>>         Kashyap
>>
>>
>>
>
>

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks guys. From what I understand, partial key grouping is used when you
know your grouping will create imbalance. In my case, most of my field
groups to one bolt thereby causing it to be a bottleneck. Since I emit
string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This
hashcode is coming out same for different string combinations...

Thanks
Kashyap
On Sep 29, 2015 17:51, "Matthias J. Sax" <mj...@apache.org> wrote:

> If you can use "partial key grouping" depends on your use case. Think
> careful before you apply it...
>
> Maybe you want to read the research paper about it. It clearly describes
> when you can use it and when not:
>
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>
>
> -Matthias
>
> On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> > Hi,
> >
> > From what I read, the default FieldGrouping did not balance the load as
> > like ShuffleGrouping do. In this case, there is a discussion about
> > custom Grouping implementation called partial key grouping where it have
> > better balancing problem. Maybe it
> > helps. https://github.com/gdfm/partial-key-grouping
> >
> > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <kashyap.m@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Thanks Derek. I use strings and I still end up with some bolts
> >     having the maximum requests :(
> >
> >     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
> >     <ma...@yahoo-inc.com>> wrote:
> >
> >         The code that hashes the field values is here:
> >
> >
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> >
> >
> >         You can write a little java program, something like:
> >
> >         public static void main(String[] args) {
> >           ArrayList<String> myList = new ArrayList<String>();
> >              myList.add("first field value");
> >           myList.add("second field value");
> >
> >           int hash = Arrays.deephashCode(myList.toArray()); // as in
> >         tuple.clj
> >
> >
> >           System.out.println("hash is "+hash);
> >           int numTasks = 32;
> >
> >           System.out.println("task index is " + hash % numTasks);
> >
> >         }
> >
> >
> >         There are certain types of values that may not hash
> >         consistently.  If you are using String values, then it should be
> >         fine. Other types may or may not, depending on how the class
> >         implements hashCode().
> >
> >
> >         --
> >         Derek
> >
> >
> >         ________________________________
> >         From: Kashyap Mhaisekar <kashyap.m@gmail.com
> >         <ma...@gmail.com>>
> >         To: user@storm.apache.org <ma...@storm.apache.org>
> >         Sent: Tuesday, September 29, 2015 4:28 PM
> >         Subject: Field Group Hash Computation
> >
> >
> >
> >         Hi,
> >         I have a field grouping based on 2 fields. I have 32 consumers
> >         for the tuple and I see most of the times, out of 64 bolts, the
> >         field group is always on 8 of them. Of the 8, 2 have more than
> >         60% of the data. The data for the field grouping can have 20
> >         different combinations.
> >
> >         Do you know what is the way to compute the Hash of the fields
> >         used for computing? One of the groups mails indicate that the
> >         approach is -
> >
> >         It calls "hashCode" on the list of selected values and mods it
> >         by the
> >         number of consumer tasks. You can play around with that function
> >         to see if
> >         something about your data is causing something degenerative to
> >         happen and
> >         cause skew
> >
> >         I saw the clojure code but not sure how to understand this.
> >
> >         Thanks
> >         Kashyap
> >
> >
> >
>
>

Re: Field Group Hash Computation

Posted by "Matthias J. Sax" <mj...@apache.org>.

If you can use "partial key grouping" depends on your use case. Think
careful before you apply it...

Maybe you want to read the research paper about it. It clearly describes
when you can use it and when not:
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf


-Matthias

On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> Hi,
> 
> From what I read, the default FieldGrouping did not balance the load as
> like ShuffleGrouping do. In this case, there is a discussion about
> custom Grouping implementation called partial key grouping where it have
> better balancing problem. Maybe it
> helps. https://github.com/gdfm/partial-key-grouping
> 
> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <kashyap.m@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Thanks Derek. I use strings and I still end up with some bolts
>     having the maximum requests :(
> 
>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <derekd@yahoo-inc.com
>     <ma...@yahoo-inc.com>> wrote:
> 
>         The code that hashes the field values is here:
> 
>         https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> 
> 
>         You can write a little java program, something like:
> 
>         public static void main(String[] args) {
>           ArrayList<String> myList = new ArrayList<String>();
>              myList.add("first field value");
>           myList.add("second field value");
> 
>           int hash = Arrays.deephashCode(myList.toArray()); // as in
>         tuple.clj
> 
> 
>           System.out.println("hash is "+hash);
>           int numTasks = 32;
> 
>           System.out.println("task index is " + hash % numTasks);
> 
>         }
> 
> 
>         There are certain types of values that may not hash
>         consistently.  If you are using String values, then it should be
>         fine. Other types may or may not, depending on how the class
>         implements hashCode().
> 
> 
>         --
>         Derek
> 
> 
>         ________________________________
>         From: Kashyap Mhaisekar <kashyap.m@gmail.com
>         <ma...@gmail.com>>
>         To: user@storm.apache.org <ma...@storm.apache.org>
>         Sent: Tuesday, September 29, 2015 4:28 PM
>         Subject: Field Group Hash Computation
> 
> 
> 
>         Hi,
>         I have a field grouping based on 2 fields. I have 32 consumers
>         for the tuple and I see most of the times, out of 64 bolts, the
>         field group is always on 8 of them. Of the 8, 2 have more than
>         60% of the data. The data for the field grouping can have 20
>         different combinations.
> 
>         Do you know what is the way to compute the Hash of the fields
>         used for computing? One of the groups mails indicate that the
>         approach is -
> 
>         It calls "hashCode" on the list of selected values and mods it
>         by the
>         number of consumer tasks. You can play around with that function
>         to see if
>         something about your data is causing something degenerative to
>         happen and
>         cause skew
> 
>         I saw the clojure code but not sure how to understand this.
> 
>         Thanks
>         Kashyap
> 
> 
>

Re: Field Group Hash Computation

Posted by Ken Danniswara <ke...@kth.se>.

Hi,

>From what I read, the default FieldGrouping did not balance the load as
like ShuffleGrouping do. In this case, there is a discussion about custom
Grouping implementation called partial key grouping where it have better
balancing problem. Maybe it helps.
https://github.com/gdfm/partial-key-grouping

On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <ka...@gmail.com>
wrote:

> Thanks Derek. I use strings and I still end up with some bolts having the
> maximum requests :(
>
> On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <de...@yahoo-inc.com> wrote:
>
>> The code that hashes the field values is here:
>>
>>
>> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>>
>>
>> You can write a little java program, something like:
>>
>> public static void main(String[] args) {
>>   ArrayList<String> myList = new ArrayList<String>();
>>      myList.add("first field value");
>>   myList.add("second field value");
>>
>>   int hash = Arrays.deephashCode(myList.toArray()); // as in tuple.clj
>>
>>
>>   System.out.println("hash is "+hash);
>>   int numTasks = 32;
>>
>>   System.out.println("task index is " + hash % numTasks);
>>
>> }
>>
>>
>> There are certain types of values that may not hash consistently.  If you
>> are using String values, then it should be fine. Other types may or may
>> not, depending on how the class implements hashCode().
>>
>>
>> --
>> Derek
>>
>>
>> ________________________________
>> From: Kashyap Mhaisekar <ka...@gmail.com>
>> To: user@storm.apache.org
>> Sent: Tuesday, September 29, 2015 4:28 PM
>> Subject: Field Group Hash Computation
>>
>>
>>
>> Hi,
>> I have a field grouping based on 2 fields. I have 32 consumers for the
>> tuple and I see most of the times, out of 64 bolts, the field group is
>> always on 8 of them. Of the 8, 2 have more than 60% of the data. The data
>> for the field grouping can have 20 different combinations.
>>
>> Do you know what is the way to compute the Hash of the fields used for
>> computing? One of the groups mails indicate that the approach is -
>>
>> It calls "hashCode" on the list of selected values and mods it by the
>> number of consumer tasks. You can play around with that function to see if
>> something about your data is causing something degenerative to happen and
>> cause skew
>>
>> I saw the clojure code but not sure how to understand this.
>>
>> Thanks
>> Kashyap
>>
>
>

Re: Field Group Hash Computation

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks Derek. I use strings and I still end up with some bolts having the
maximum requests :(

On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <de...@yahoo-inc.com> wrote:

> The code that hashes the field values is here:
>
>
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>
>
> You can write a little java program, something like:
>
> public static void main(String[] args) {
>   ArrayList<String> myList = new ArrayList<String>();
>      myList.add("first field value");
>   myList.add("second field value");
>
>   int hash = Arrays.deephashCode(myList.toArray()); // as in tuple.clj
>
>
>   System.out.println("hash is "+hash);
>   int numTasks = 32;
>
>   System.out.println("task index is " + hash % numTasks);
>
> }
>
>
> There are certain types of values that may not hash consistently.  If you
> are using String values, then it should be fine. Other types may or may
> not, depending on how the class implements hashCode().
>
>
> --
> Derek
>
>
> ________________________________
> From: Kashyap Mhaisekar <ka...@gmail.com>
> To: user@storm.apache.org
> Sent: Tuesday, September 29, 2015 4:28 PM
> Subject: Field Group Hash Computation
>
>
>
> Hi,
> I have a field grouping based on 2 fields. I have 32 consumers for the
> tuple and I see most of the times, out of 64 bolts, the field group is
> always on 8 of them. Of the 8, 2 have more than 60% of the data. The data
> for the field grouping can have 20 different combinations.
>
> Do you know what is the way to compute the Hash of the fields used for
> computing? One of the groups mails indicate that the approach is -
>
> It calls "hashCode" on the list of selected values and mods it by the
> number of consumer tasks. You can play around with that function to see if
> something about your data is causing something degenerative to happen and
> cause skew
>
> I saw the clojure code but not sure how to understand this.
>
> Thanks
> Kashyap
>

Re: Field Group Hash Computation

Posted by Derek Dagit <de...@yahoo-inc.com>.

The code that hashes the field values is here:

https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24


You can write a little java program, something like:

public static void main(String[] args) {
  ArrayList<String> myList = new ArrayList<String>();
     myList.add("first field value");
  myList.add("second field value");

  int hash = Arrays.deephashCode(myList.toArray()); // as in tuple.clj


  System.out.println("hash is "+hash);
  int numTasks = 32;

  System.out.println("task index is " + hash % numTasks);

}


There are certain types of values that may not hash consistently.  If you are using String values, then it should be fine. Other types may or may not, depending on how the class implements hashCode().

 
-- 
Derek


________________________________
From: Kashyap Mhaisekar <ka...@gmail.com>
To: user@storm.apache.org 
Sent: Tuesday, September 29, 2015 4:28 PM
Subject: Field Group Hash Computation



Hi,
I have a field grouping based on 2 fields. I have 32 consumers for the tuple and I see most of the times, out of 64 bolts, the field group is always on 8 of them. Of the 8, 2 have more than 60% of the data. The data for the field grouping can have 20 different combinations.

Do you know what is the way to compute the Hash of the fields used for computing? One of the groups mails indicate that the approach is -

It calls "hashCode" on the list of selected values and mods it by the 
number of consumer tasks. You can play around with that function to see if 
something about your data is causing something degenerative to happen and 
cause skew

I saw the clojure code but not sure how to understand this.

Thanks
Kashyap