You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2020/10/30 02:00:45 UTC
Apache Pinot Daily Email Digest (2020-10-29)

### _#general_

  
 **@xmuyoo:** @xmuyoo has joined the channel  
 **@mail460:** @mail460 has joined the channel  
 **@g.kishore:** Passcode: 541532  
 **@g.kishore:** meetup happening now ^^  
 **@azhar93:** @azhar93 has joined the channel  
 **@ravibabu.chikkam:** is this meeting only for uber employees?  
**@mayanks:** No  
**@mayanks:** You are welcome to join  
**@ravibabu.chikkam:** the link is not working  
**@ravibabu.chikkam:** asking for sign in  
**@mayanks:** Hmm, perhaps it is  
**@mayanks:** Could you try signing in there, and join the event from there
(may have to rsvp)  
**@mayanks:**  
 **@sree_at_chess:** @sree_at_chess has joined the channel  
 **@sree_at_chess:** Hi  
 **@gary.a.stafford:** @gary.a.stafford has joined the channel  

###  _#random_

  
 **@xmuyoo:** @xmuyoo has joined the channel  
 **@mail460:** @mail460 has joined the channel  
 **@azhar93:** @azhar93 has joined the channel  
 **@sree_at_chess:** @sree_at_chess has joined the channel  
 **@gary.a.stafford:** @gary.a.stafford has joined the channel  

###  _#feat-presto-connector_

  
 **@nguyenhoanglam1990:** @nguyenhoanglam1990 has joined the channel  

###  _#troubleshooting_

  
 **@elon.azoulay:** Another migration question: our kafka cluster will be
moving and offsets will be reset. How can we ensure that pinot keeps
ingesting, is there a way to do this with no downtime?  
 **@elon.azoulay:** reset strategy is "earliest" but is there an api call to
tell pinot to reset realtime ingestion?  
**@ssubrama:** If offsets are reset I see no way to recover the realtime
table. You will need to drop the realtime table and recreate it. At some
point, the consumers will just be waitinfor events and see no more  
**@ssubrama:** Offsets are expected to increase monotonically (not necessarily
sequentially)  
**@elon.azoulay:** Got it, this is really good to know!  
 **@mayanks:** How many tables do you have in your cluster?  
 **@g.kishore:** do you have offline table?  
 **@elon.azoulay:** Yes  
 **@elon.azoulay:** We have ~15 tables, all hybrid except 2 that are realtime
only  
 **@elon.azoulay:** This is staging, so we can test it out  
 **@elon.azoulay:** I will see if we can set the offsets on the new cluster.  
 **@elon.azoulay:** Is it possible to update a realtime table def to change
the kafka broker url with no issues?  

###  _#time-based-segment-pruner_

  
 **@snlee:** @snlee has joined the channel  
 **@g.kishore:** @g.kishore has joined the channel  
 **@jiapengtao0:** @jiapengtao0 has joined the channel  
 **@noahprince8:** @noahprince8 has joined the channel  
 **@mayanks:** @mayanks has joined the channel  
 **@snlee:** hello  
 **@snlee:** i created this ticket to discuss about the plan on time based
segment pruner  
 **@noahprince8:** So a bit around our use case for this. We have a little
over a petabyte of data. Pinot seems most commonly used with < 100TB. At that
scale, you’re looking at millions of segments. And a large EBS volume bill. So
part 1 of that is lazily fetching segments. Part 2 is ensuring that the broker
prunes segments aggressively to minimize the number of segments lazily fetched
in deep historical servers.  
 **@snlee:** So part 1& 2 can happen in parallel  
 **@snlee:** So to update on our side: • We are working on evaluating 2
interval search algorithms for the pruner (1. O(N) naive for each loop 2.
O(m*logN) using interval search tree) • Once the study is done, we will start
to implement the broker side pruner • For ETA, I will discuss with
@jiapengtao0 and update  
 **@noahprince8:** From what I understand, segments are currently stored in
Zookeeper. Not sure if zookeeper supports a time index, or can scale well at
millions of segments.  
 **@noahprince8:** If you’re building a search tree from the segments in
zookeeper, presumably you need to have all of the segments in memory  
 **@noahprince8:** That’s going to cause issues as segments scale  
 **@snlee:** @noahprince8 So, we fetch the segment metadata on `external view
change` . This happens only if new segments are added or deleted (there are
some other cases but in the happy path…). 1\. we periodically fetch segment
metadata info and update the info in memory <\- this part can be expensive
since it happens few times per day 2\. we use in-memory info for segment
pruning  
 **@snlee:** so, pinot’s scalability will bound to the information that we
need to store in memory.  
 **@snlee:** if you have billions of segments, then we will eventually run out
of memory.. but for millions, i think it’s doable.  
 **@noahprince8:** If you have a 500 TB table with 50mb segments, that would
be 10,000,000 segments. I think that could cause OOMs in the broker holding
this search tree. I’m also a little unfamiliar with the way helix’s message
passing works. But I imagine state changes might also become an issue at this
scale.  
 **@noahprince8:** Yeah, I’m curious what the memory footprint is of a million
segments. We could also up the segment size to lower the count. Or potentially
merge very old segments.  
 **@noahprince8:** But I’m wondering if we could instead delegate segment
storage to something like influxdb with a time series index. It’s custom-built
to hold this type of data in memory as long as it can, and flush to the disc
when it needs to  
 **@noahprince8:** Or… honestly, what if we just make segment storage
pluggable? Because for most use cases the in memory index will do. Then my
company could write a custom plugin that offloads into influx for larger
tables.  
 **@snlee:** I think that iceberge from Netflix is also doing the similar
thing  
 **@g.kishore:** I dont really see why influxdb will be any better here  
 **@noahprince8:** Yeah. They do it with files stored in s3 I believe.
Snowflake also does a similar kind of indexing of segments, and they use
foundationdb  
 **@noahprince8:** @g.kishore influx just as an example, I haven’t actually
used it. But generally something that can effectively index/query time indexed
data. That uses a hybrid memory caching and disc based approach so we know it
will scale with the volume.  
 **@snlee:** Pinot is also one type of storage + query engine solutions for
index/querying time series data.  
 **@noahprince8:** Because Pinot is performing pretty well with a couple days
of data. My main concern is that it will fall over when we introduce some
really large data sets. And maybe Pinot isn’t the best solution for these
datasets. I could be trying to shove a square peg in a round hole here.  
 **@noahprince8:** Ha using pinot to index pinot segments for large tables.
That would be funny.  
 **@noahprince8:** Could even make a tree of pinots if your segment index gets
too big so it needs its own segment index :smile:  
 **@snlee:** So do you need to serve one giant table? or you will have many of
these?  
 **@noahprince8:** Many of these  
 **@snlee:** I think that we can roll out the feature in step. We will anyway
make this pruner configurable per table.  
 **@noahprince8:** How are segments added? If you could have a pluggable
segment adder and pruner that'd be cool  
 **@snlee:** adder?  
 **@snlee:** pruner is pluggable yes  
 **@noahprince8:** Yeah. I'm a bit unfamiliar with the architecture. But I
assume there's an event bus with new segments  
 **@snlee:** ohoh  
 **@snlee:** so we put the zk watcher from the broker  
 **@noahprince8:** So if you could make it completely detachable so that you
can abstract segment storage and querying  
 **@snlee:** and broker will be notified if there’s any update on the segment
for teh table  
 **@snlee:** Our deep storage (or segment store) has been abstracted out.  
 **@snlee:** but our query execution is closely mingled with mmaped segments
placed locally. At the Pinot server level, there’s no store & querying
separation.  
 **@snlee:** yeah we can abstract out the segments instead of requiring it to
place locally. but that will be a huge architectural change. also, i’m not
sure if it’s possible to do mmap to a remote file  
 **@noahprince8:** Ah. That makes sense  
 **@snlee:** we can keep this as a separate discussion. I think that it’s an
interesting topic since eventually, separated store & execution will fit
better on cloud environment (if it’s possible to achieve for pinot without
perf degradation)  
 **@snlee:** so, for supporting your use case,  
 **@snlee:** 1\. I think that you can use larger segment size. 500MB segment
size will work pretty well. That will reduce the # of segments by the factor
of 10 2\. We will first check in the in-memory based time based pruner. If you
can verify from your use case, it will be great.  
 **@noahprince8:** Yeah, I’ll definitely test it with a few months of a large
dataset  
 **@snlee:** sounds good  
 **@snlee:** we will update here when we have the update on the pr.  
 **@noahprince8:** Yeah, so for rough idea of scale, our largest dataset is
around 3 TB a day (taken from our kafka input rate on a busy day. This is a
compressed binary protocol). Mid size one is around 300 GB a day. Total we’re
looking at around 10 TB a day. The 3 TB data set can be made more efficient to
make it slightly smaller. But with 252 trading days in a year, that big
dataset is going to be 756TB. So 12 million segments. Is it realistic to be
trying to store something this large in Pinot? Or should we be using
traditional data warehousing techniques like hive, iceberg files, etc.  
 **@mayanks:** Would you be able to provide a bit more info on the use case
(eg content of data, what you want to query)?  
 **@noahprince8:** Yeah, so this is full depth of the books in an options
exchange. Blows up pretty quickly because for every exchange, there are
multiple underlying contracts. For each one of those, many futures at various
expirations. For each future, multiple options contracts at various strike
prices. For each one of those, there’s multiple levels of bid price/quantity,
ask price/quantity listed from best to worst offer.  
 **@mayanks:** Ok. What kind of queries would you run?  
 **@noahprince8:** Plot the top of book for this particular options contract
over this 2 minute interval Give me the average top of book for some contract
over some time interval. Join this to some other dataset, like our theoretical
price of an option and compare.  
 **@mayanks:** so queries limited to a particular contract, or can span across
contracts?  
 **@noahprince8:** Most likely limited to a particular contract. Though
ultimately I’m not sure how quants will use it. They may query multiple
options for the same future. Or multiple options for the same underlying
stock.  
 **@noahprince8:** Which is why I’m very interested in the bloom filtering.
Because a bloom filter on underlying and expiry + an efficient time filter
could cut 12 million segments down to 2  
 **@mayanks:** For queries limited to particular contract, partitioning on
contract would prune all other contracts  
 **@mayanks:** Is your 3TB data size translating to 3TB of Pinot segments, or
do you see compression?  
 **@mayanks:** We have cases with 100MB - 1GB segment size (per segment), so
definitely increasing the segment size will help reduce broker memory usage  
 **@noahprince8:** I suspect you’d see some compression. I have not put that
dataset into pinot, yet.  
 **@mayanks:** Ok, would be good to pick one file and see what the compression
is  
 **@noahprince8:** Yeah. More so I want to prepare for millions of segments so
we don’t go all in on pinot and hit scaling issues later.  
 **@noahprince8:** So we know something like iceberg files will probably work,
but will be slower than pinot. Doing due diligence now to ferret out
everywhere pinot might crack at scale. As is typical in data infra though,
there’s a limited amount we can tell without actually _testing_ it. Let’s see
how it works with the in-memory merge tree. I’ll work on the lazy loading from
s3. Then I can spinup a testing cluster in AWS and backfill a few months of
data.  
 **@jiatao:** @jiatao has joined the channel  
 **@snlee:** cool  
 **@noahprince8:** Sorry for being annoying with this haha. Will try to make
up for it by contributing some code :smile:  
 **@snlee:** :+1:  
 **@snlee:** imo, if you can get to merge your files to become ~500MB to
reduce the # segments to a single digit millions, in-memory pruning should be
fine.  
 **@noahprince8:** Yeah, I think also we can create different tenants for the
super large tables  
 **@noahprince8:** Then use presto or that rest thing that uber made to unify
them  
 **@noahprince8:** We could also introduce another layer of tiering so that
after x months, we create huge day-long segments.  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org