You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/03/07 18:14:32 UTC

[GitHub] [pinot] jackjlli opened a new issue #8306: Upgrade Helix to 1.0+ in Pinot

jackjlli opened a new issue #8306:
URL: https://github.com/apache/pinot/issues/8306


   We’re planning to upgrade the Apache Helix version from 0.9.8 to 1.0+ in Pinot repo. This will not only address the issues we’ve seen in the current 0.9.8 release (which have been addressed in Helix 1.0+ and the Helix project will not plan on back-porting to 0.9.x), but also provide the opportunities to build more Pinot features with the new features in Helix 1.0+. An [initial attempt](https://github.com/apache/pinot/pull/7500) at moving to 1.0 has also detected some test failures in Pinot, so this time we’ve sought help from the Helix team.
   
   ### What we get from Helix 1.0+
   Here are a few items that can be addressed after bumping up the Helix version to 1.0+.
   **ZNRecord serialization issue**
   In the serialize() method of [ZNRecordSerializer](https://github.com/apache/pinot/pull/7500), a very expensive Jackson “ObjectMapper” is constructed every time, and this is already addressed in Helix 1.0+.
   Burst ZK write issue during brokers startup in large cluster
   In a large Pinot cluster where there are thousands of Pinot tables, broker restart generates a burst of ZK access to the current state causes, and the Helix controller takes longer (20 mins) to calculate the ideal state. The algorithm of calculating the ideal state is improved in the later Helix 1.0+ release. 
   
   **Lack of pagination support**
   Because of the lack of pagination support in Helix 0.9.8, a huge amount of ZNodes needs to be read from ZK to Pinot during the startup, which will cause a huge burst of ZK read and write access, especially in the huge cluster which maintains thousands of Pinot tables. This pain point can be addressed by the [Zk Client API pagination](https://github.com/apache/helix/wiki/Helix-ZkClient-API-to-Support-Getting-a-Large-Number-of-Children-Using-Pagination) feature in the Helix 1.0+ version. This feature is needed to support Pinot clusters with a large number of tables and segments.
   
   **State transition task prioritization**
   Currently Helix tasks are picked up by the participant based on the inQueue time. While there is some scenario that the tasks which are queued later need to be picked up first (due to some constraints like disk usage, etc). In Pinot the custom Helix state model called "SegmentOnlineOfflineStateModel" is used for segment level state transition. The state transition "OFFLINE->ONLINE" downloads a new segment to the local disk, and the one "OFFLINE->DROPPED" deletes the segment from the local disk. While we notice that the state transition "OFFLINE->ONLINE" always comes before "OFFLINE->DROPPED", which makes the pinot server busy downloading new segments and then fills up the full disk. This [issue](https://github.com/apache/helix/issues/1889) will be addressed only in the Helix 1.0+ version.
   
   ### What opportunities that Helix 1.0+ provides
   Other than that, there are several other features in Helix 1.0+ that can be the building blocks for future Pinot features.
   
   **Leverage weight based rebalancer**
   The new [weight based rebalancing](https://github.com/apache/helix/wiki/Weight-aware-Globally-even-distribute-Rebalancer) algorithms can be added to Pinot to support features like weight-based segment assignment and weight-based routing assignment. 
   _Weight-based segment assignment_
   
   Right now, Pinot considers all segments as having the same weight. This may not always hold true once we land onto Helix 1.0+. With this new algorithm, new Pinot segments could have the opportunity to be assigned to the hardware with more resources like higher ram, larger SSD, etc, as the newer segments might be queried more frequently than the older ones. Whereas the older segment can be moved to the cheaper hardware, in order to reduce cost.
   _Weight-based broker routing assignment_
   
   Currently all the brokers with the same tag will be regarded as the same. All the queries with totally different query patterns would be routed to the same host. With this new algorithm, brokers with different resources can have the ability to handle different kinds of query patterns.
   
   **Leverage [FederatedZkClient](https://github.com/apache/helix/wiki/FederatedZkClient)**
   The FederatedZkClient feature has the ability to maintain multiple ZK connections to different ZK realms, which provides the ability for Pinot to consider splitting the large Pinot cluster into multiple ones.
   
   Helix 1.0+ has been in use in production at LinkedIn for years, and it’s considered to be stable by the Helix team. At LinkedIn there are a wide variety of Pinot use cases that cover all the scenarios interacting with Helix. 
   
   ### Approach
   We’re going to follow the steps below in order to make the upgrade smoothly. Any step with a higher number cannot proceed if any of the steps with lower numbers are blocked.
   
   - Step 1: Create a branch and change pinot source code to be on Helix-1.x in the branch
   
   - Step 2: Deploy Pinot with Helix 1.x (from the branch) on LinkedIn staging and production environments and validate (this step may take a few weeks depending on the problems encountered)
   
   - Step 3: Merge the open source branch back to the master branch
   
   We’ll also update the status to this issue on the completeness of each of the steps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org