You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2020/11/03 02:00:24 UTC
Apache Pinot Daily Email Digest (2020-11-02)

### _#general_

  
 **@ehwalee:** @ehwalee has joined the channel  
 **@noahprince8:** How does fault tolerance work with servers in Pinot? I.e.,
what happens when a server crashes? My guess would be that, as a helix
participant, somehow the controller sees it has crashed, and as such sends out
messages to other servers to take the segments from the crashed controller?
Then, when the server reboots, the controller sees a new server is available,
and starts distributing segments to it? Requiring a rebalance to truly get a
bunch of segments back onto it?  
**@tanmay.movva:** In my understanding, yes that is how it happens. All helix
participants, controllers maintain their state in zookeeper. As per the
config/replication we’ve set the controller always works to maintain the ideal
state of the cluster and it’s segments. In the case where a new server joins
the helix cluster, controller triggers a rebalancing by redistributing
segments across segments to maintain a balanced ideal state. It asks the new
server to load segments from the deep store by putting messages into the
server’s queue. This happens till the IDEAL state of the cluster is reached.
Hope this helps. Correct me if I don’t make sense.  
**@noahprince8:** It does. So another question, I’m digging into the internals
and it doesn’t look like there’s a message triggering the removal of a
segment, other than from the transition from ONLINE to OFFLINE  
**@noahprince8:** I started to trace down how rebalances work, but it seems
pretty complex. When a rebalance occurs, how are segments removed? What is
that event, and how does it hook in? I would expect it to call
`HelixInstanceDataManager.removeSegment`  
**@noahprince8:** For context, I’m working on  
**@tanmay.movva:** From what I understand, there is a state assigned/attached
to individual segments also. Which has info such as which servers to be
present on etc,. So when a controller triggers rebalancing, I’m expecting it
updates the state of the segments, which inturn changes the ideal state of the
cluster, and to achieve that ideal state it asks servers to load/unload
segments onto their disks/volumes. I haven’t dwelled into code myself. Would
not be able to help with exact code blocks where this happens. But would love
to know it. :slightly_smiling_face:  
**@g.kishore:** @noahprince8 read a bit about Helix here  
**@g.kishore:** once you get that following Pinot code will be easier.  
**@g.kishore:** all callbacks from Helix are already handled in Pinot, when
you trigger rebalance - new segments are loaded via offlineToOnline callbacks  
**@g.kishore:** see SegmentOnlineOfflineStateModel for more details  
**@g.kishore:** ```// Delete segment from local directory. @Transition(from =
"OFFLINE", to = "DROPPED") public void onBecomeDroppedFromOffline(Message
message, NotificationContext context) {```  
**@g.kishore:** this gets triggered when a segment in removed from the
idealstate for a specific server  
**@noahprince8:** So, it looks like that directly removes the directory but
never removes it from the table data manager  
 **@karinwolok1:** :wave: Hi everyoneeeee!!! I'm new to the Pinot community
and I want to shape some of the Pinot community programs for the future.
:wine_glass: I'm looking to get to know some people in the community (newbies
and/or seasoned). Anyone here open to having a 30 min video conversation with
me? Pleaaaassssseeeee!!!! :star-struck: (I only look creepy in the pic of me
hiding in corn fields... I'm not _that_ bad). Send a DM or just schedule a
time to chat:  
**@narayanan.arunachalam:** @narayanan.arunachalam has joined the channel  
 **@kennybastani:** Welcome @karinwolok1! So excited to have you  
**@karinwolok1:** Thank youuuuu, Kenny! :heart:  
 **@gregsimons84:** Hey @karinwolok1 Welcome to the wonderful world of Apache
Pinot. Great to have you onboard !  
**@karinwolok1:** Thanks, Greg!  
 **@kennybastani:** Hey all. I highly recommend spending some time with
@karinwolok1 to talk about your use case and experience with Pinot. This is
extremely crucial to the project and for our future. I rarely use `@here`
notifications to this channel, but in this case, it’s important. Thanks
everyone.  
**@karinwolok1:** Thanks, Kenny! Maybe I am downplaying all the awesome things
we can do with the Pinot community and that this is an opportunity to really
be part of the movement . :D  
**@kennybastani:** I think you’re spot on.  
 **@dzpanda:** @dzpanda has joined the channel  

###  _#random_

  
 **@ehwalee:** @ehwalee has joined the channel  
 **@narayanan.arunachalam:** @narayanan.arunachalam has joined the channel  
 **@dzpanda:** @dzpanda has joined the channel  

###  _#feat-text-search_

  
 **@gary.a.stafford:** @gary.a.stafford has joined the channel  

###  _#feat-presto-connector_

  
 **@gary.a.stafford:** @gary.a.stafford has joined the channel  

###  _#troubleshooting_

  
 **@gary.a.stafford:** @gary.a.stafford has joined the channel  

###  _#segment-cold-storage_

  
 **@noahprince8:** @noahprince8 has joined the channel  
 **@noahprince8:** @noahprince8 set the channel purpose: Discussing  
**@g.kishore:** @g.kishore has joined the channel  
 **@fx19880617:** @fx19880617 has joined the channel  
 **@g.kishore:** thanks for creating this channel  
 **@noahprince8:** Feel free to add anyone as necessary. Felt like it might be
better/quicker to discuss things here than the github issue  
 **@mayanks:** @mayanks has joined the channel  
 **@jackie.jxt:** @jackie.jxt has joined the channel  
 **@noahprince8:** So, the original plan was to create a class of servers that
lazily loads segments on demand for queries. While this works, the idea of
cold storage dovetails with the idea that you’re probably going to have _many_
segments. Each segment means overhead for helix reaching an ideal state. What
if we instead create a separate class of OFFLINE table with some COLD_STORAGE
flag enable. Cold storage tables are not part of the helix ideal state. We
explicitly accept they are going to have a lot of segments that are used
infrequently. Assignment to servers _explicitly_ happens on demand. A query
would look like this: 1\. Broker prunes and finds the ideal list of segments
2\. Routing table indicates particular segments are unassigned. 3\. Broker
somehow communicates the need for this segment to be assigned to a server. 4\.
Broker waits for ideal state 5\. Broker runs query as usual. 6\. A minion
prunes cold storage segments that haven’t been used in a while  
 **@noahprince8:** The nice thing about this is that the code for the server
doesn’t need to change. As far as it’s concerned, it’s just getting assigned
normal OFFLINE segments from the deep store.  
 **@noahprince8:** This also makes it easier to evenly distribute cold storage
segments for queries across servers, potentially avoiding hotspotting if all
of the cold storage segments for a particular time range are on the same
server.  
 **@npawar:** @npawar has joined the channel  
 **@noahprince8:** This actually fits pretty well into the tiering model, with
`storageType=DEEP_STORE`  
 **@g.kishore:** how will the servers know which segments to load for a query
when its not in the idealstate  
 **@noahprince8:** The broker will need to handle that. It has to assign
servers to segments if the segment has not been assigned to any servers. It
then has to wait for that state to be achieved before running the query  
 **@g.kishore:** ah, you are saying segments are still part of idealstate  
 **@noahprince8:** But only actively used segments  
 **@g.kishore:** its just that the table is not part of Helix  
 **@noahprince8:** So effectively the segments only become part of helix when
someone is trying to query them  
 **@noahprince8:** Else, they are just stored in the routing table with no
assigned servers  
 **@mayanks:** How does this solve the memory issue in broker, once a bunch of
queries that have touched all segments are executed?  
 **@g.kishore:** but all the things you have mentioned can still be achieved
by having a LazyImmutableSegmentLoader right  
 **@noahprince8:** It doesn’t solve that issue, we’d need a separate solution.
It does solve the issue of too many messages as in  
**@noahprince8:** I think for cold storage it may be reasonable to impose tier
configuration around the maximum number of segments a query is allowed to
materialize.  
 **@mayanks:** How about something like Redis Cache?  
 **@mayanks:** So we keep caching orthogonal to state management?  
 **@noahprince8:** My concern is helix managing millions of segments, 99% of
which are never touched.  
**@mayanks:** But this proposal is not addressing that problem, is it?  
**@noahprince8:** It is, you only tell helix about that segment (and to try to
assign it to a server) when someone attempts to query it  
**@mayanks:** Side note query latency would go out the window if helix segment
assignment is happening during query execution.  
**@noahprince8:** Yeah, I think that’s acceptable for cold storage. You don’t
have to wait for the _full_ ideal state. You can start to run pieces of the
query when the segments finish downloading. Not sure how hard that would be to
achieve.  
**@noahprince8:** Even in the lazy model, you still have that problem.  
 **@mayanks:** I think there are separate issues here: ```1. Huge Ideal State
2\. Broker memory pressure 3\. Storage cost on server```  
 **@noahprince8:** So > 1\. Huge Ideal State > > Do not include cold storage
segments in ideal state until they are queried. The broker will need to
request segments to be added to the ideal state if they aren’t currently
included. A cleanup minion is required > > 2\. Broker memory pressure > >
Allow configuration on cold storage tiers that limits the number of segments
that can be materialized by a single query > > 3\. Storage cost on server > >
Because unused segments are not part of the ideal state, OFFLINE servers will
not load them onto the disc.  
**@mayanks:** This ^^ will also need unloading of segments from IS , broker &
server  
 **@noahprince8:** IS?  
 **@mayanks:** ideal-state  
 **@noahprince8:** Yeah, I figure you ave a minion doing that  
 **@mayanks:** I see  
 **@noahprince8:** Keep it simple to start and just have a time and size based
eviction.  
 **@mayanks:** How would the selectivity of the query be?  
 **@noahprince8:** What do you mean? Like selectivity of cold storage queries?  
 **@mayanks:** Low selectivity -> large number of rows selected to process,
and vice-versa  
 **@noahprince8:** Well, for our use case high selectivity. But I imagine not
everyone’s is like that.  
 **@noahprince8:** Ideally, you want high selectivity with this. But this is
why ```Allow configuration on cold storage tiers that limits the number of
segments that can be materialized by a single query ``` Is important  
 **@mayanks:** Ok, that is good. So we don't have to worry about server memory
pressure during query execution.  
 **@noahprince8:** I think we can implement 1 and 3 without 2, though.
Allowing configuration to limit the number of segments used in a single query
is a separate, nice to have feature.  
 **@noahprince8:** As a proof of concept, we can tell people not to make dumb
queries :slightly_smiling_face:  
 **@noahprince8:** Really, that feature is useful on all OFFLINE tables. Not
just cold storage  
 **@mayanks:** I am wondering how much bottleneck 1 really is  
 **@mayanks:** We have a cluster with 10's of millions of segments (across
tables)  
 **@noahprince8:** It is, according to Uber and LinkedIn a bottleneck. Though
I personally do not have experience with it  
 **@mayanks:** I'd start of with 3 first, and see if 1 is really a bottleneck  
 **@g.kishore:** LinkedIn definition of bottleneck is very different  
 **@g.kishore:** we try to work on millisecond and sub second  
 **@noahprince8:** Though I think, in general, it’s a better design to manage
the caching outside of the individual server  
 **@noahprince8:** Makes it easier to avoid hotspotting  
 **@noahprince8:** It’s just a nice bonus that design also avoids message
pressure on helix  
 **@mayanks:** But LazyLoader impl is independent of server right?  
 **@noahprince8:** No, lazy loader would be on the server itself. It knows the
segments assigned to it, and only downloads them from deep store as needed  
 **@noahprince8:** Probably use something like Caffeine to handle the LRU
aspect  
 **@mayanks:** Correct. And this way we have taken out Helix from the business
of sending more messages to load/unload?  
 **@g.kishore:** lets start with that  
 **@g.kishore:** LazyLoader  
 **@g.kishore:** I have some ideas w.r.t Helix but lets tackle the lazyloading
first  
 **@g.kishore:** Note that we can also enhance Helix  
**@mayanks:** @noahprince8 in case you were not aware, @g.kishore is the
author of Helix :grinning:  
**@noahprince8:** Hahah yeah, I know, he’s got an impressive resume. It’s
honestly very cool to be having these conversations. Especially given a year
ago I was doing web dev :smile:  
**@mayanks:** Very impressive transition from web dev :clap:  
**@noahprince8:** Data infra is where it’s at. Got tired of transpiling things
70 times :slightly_smiling_face: . The two fields are kind of similar on the
backend, though. When you get into multiple dbs, microservices, job queues,
load balancers, event driven systems. Wasn’t a huge setback switching fields,
luckily  
**@noahprince8:** But yeah, architecture discussions with the titans of the
data infra industry, on the bleeding edge. Pretty damn cool.  
**@mayanks:** :heavy_plus_sign:  
 **@noahprince8:** Well, my concern is that if we pull this lazy loading away
from the server, these code changes aren’t really necessary  
 **@noahprince8:** No server change necessary if you use cluster state to
manage lazy loading  
 **@noahprince8:** And really, which segments are lazy loaded _is_ an
important part of the cluster state. It’s useful from a monitoring
perspective.  
**@mayanks:** Good point  
 **@g.kishore:** I see  
 **@g.kishore:** let me think about that idea  
 **@noahprince8:** There’s also practical aspects. Like the UI will force
materialization of segments because it wants to show all of the segments
assigned to the server  
 **@noahprince8:** So I’d need to “hack” that to only show the ones that have
actually materialized.  
 **@npawar:** the unloading could also be done by the same periodic task which
is responsible for moving from tier to tier (assuming you are planning to use
the tiered storage design). Based on some flag in the tier config, for
storageType=deepStore  
**@noahprince8:** Cache unloading?  
**@npawar:** unloading from ideal state  
**@npawar:** instead of minions  
**@noahprince8:** Ah yeah. Is that a scheduled task or long running?  
**@mayanks:** Scheduled  
**@noahprince8:** Would love for some way to listen to an event bus of hit
segments. Because something like  would be super nice for something like this.  
**@noahprince8:** And really, making cache eviction configurable. Because for
specific use cases, you may want an intelligent caching strategy.  
 **@npawar:** will help in reducing 1 component  
 **@g.kishore:** lets do a zoom call?  
 **@noahprince8:** I’m down. When?  
 **@g.kishore:** 3 pm pst?  
 **@noahprince8:** Earlier might be better, if you’re available  
 **@g.kishore:** 1:30 pm pst  
 **@noahprince8:** Works for me  
 **@mayanks:** Works for me too  
 **@mayanks:** Could you guys send me your email id's to send the invite to?  
**@noahprince8:**  
**@mayanks:** @npawar?  
**@npawar:**  
**@mayanks:** Sent invite for 1:30-2:00pm today  
**@jackie.jxt:** @mayanks Can you please also add me to the invite:  
**@mayanks:** Done  
**@jackie.jxt:** Thanks  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org