You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "xiaojian zhou (JIRA)" <ji...@apache.org> on 2019/08/01 17:24:00 UTC
[jira] [Resolved] (GEODE-6973) getExistingIdForType should not compare all entries in idToType region

     [ https://issues.apache.org/jira/browse/GEODE-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

xiaojian zhou resolved GEODE-6973.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.10.0

> getExistingIdForType should not compare all entries in idToType region
> ----------------------------------------------------------------------
>
>                 Key: GEODE-6973
>                 URL: https://issues.apache.org/jira/browse/GEODE-6973
>             Project: Geode
>          Issue Type: Bug
>            Reporter: xiaojian zhou
>            Assignee: xiaojian zhou
>            Priority: Major
>              Labels: GeodeCommons
>             Fix For: 1.10.0
>
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> We found the PeerTypeRegistration's getExistingIdForType() will iterate through the idToType region's entries to find if the incoming newType is there. 
> If idToType region contains 20K or 100K entries, this will impact the put throughput (customers did notice the performance downgrade when there're many pdxTypes). 
> To make the things worse, the comparison is to compare the whole object, field to field. If the json object (which will be converted to pdxType) contains 30 fields, the comparison will have to compare up to 30 fields. If the idToType region contains 20K entries, A new pdxType will do 20K  x 30 string comparisons before register it. 
> We found each server maintained a typeToId map, this map is used to check if the pdxType exists. If exists, it will return the type id without check the IdToType region. The total number of pdxType did not impact the put performance if the pdxTypd exists. 
> The typeToId map is maintained with a d-lock, each time we added a new pdxType, it will update into the map while still holding the d-lok. So we believe that the map should be the same as the region in content. If we cannot find the pdxType in the map, it should not be in the region. We can skip the iteration of region (which is the root cause of the performance issue). 
> Another issue in current code is: when each time a new type come, it will recreate the map. This is unnecessary and contributes to the slowness too. 
> We should only create the map during initialize(). 
> Here are the tests we want to introduce:
> 1) a junit test to prove that reorder fields in a big JSON file will not cause significant hashcode conflicts (<1%)
> 2) a junit test to prove that add a index to a field in a big JSON file will hardly cause hashcode conflicts. 
> This 2 tests are to prove that hashcode conflict is not the root cause of linear probing for PDXTypeId. 
> 3) a junit test to prove that for the cases that hashcode conflict caused by reordered fields, there will be no hashcode conflicts if using SORT_JSON_FIELD_NAMES_PROPERTY=true. 
> 4) a dunit test to prove that SORT_JSON_FIELD_NAMES_PROPERTY=true or false did not impact the performance to add a new pdxType. 
> 5) a dunit test to create a new pdxType from 2 peer server at the same time. The test is to prove that the d-lock take effect, one server create the pdxType, and another server should find the pdxType exists. 
> Do this test both from server directly and from clients. 
> 6) Create 2 different objects which ends up with the same hashcode (we can get the 2 objects from test-1), try to put the 2 objects to create new pdxType. The 2nd one should also create a new type. It should not be treated as "found an existing pdxType". 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)