You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2020/11/14 02:00:16 UTC
Apache Pinot Daily Email Digest (2020-11-13)

### _#general_

  
 **@pabraham.usa:** anyone got a performance comparison of storing index in
HDD/ SSD / EFS / NFS ? As usual I assume SSD will be the high performer?  
**@mayanks:** Two observations that I have seen. If data fits in memory, HDD
and SSD performance is similar in terms of query latency. When data starts
spilling over to SSD can make a lot of difference. We were able to push it to
full capacity for read throughput for a use case to get ms latency.  
**@mayanks:** As expected of course, but good to establish that in practice as
well  
**@mayanks:** We typically NFS as a deep store (which is not in the read
path), and not as attached disk for serving nodes  
**@ssubrama:** @pabraham.usa one more to add to Mayank's observation. If you
have HDD, be prepared to take a latency hit when new segments are loaded. Even
if segments are replaced _in situ_ and both of them fit in memory, we have
seen high latency during the replacement time.  
**@ssubrama:** Of course, it also depends on how much data you have, the kind
of queries you run, etc.  
**@mayanks:** Yep, HDDs have a cold start problem, especially when all data is
being refreshed  

### _#troubleshooting_

  
 **@noahprince8:** for tagging a server. Do you just have to create the server
then manually tag afterword?  
**@g.kishore:** yes  
**@noahprince8:** So how does one achieve autoscaling if you have to manually
tag new servers?  
 **@ken:** I was running Pinot locally, and wanted to reload some revised
data. So I first deleted all segments for the target table via `curl -v ""
-XDELETE`, which seemed to work, then re-ran my import job via `bin/pinot-
admin.sh LaunchDataIngestionJob -jobSpecFile <spec file>`, which also seemed
to work. But when I run a query, I get ```ProcessingException(errorCode: 450,
message: InternalError: java.net.SocketException: Host is down(connect
failed)```  
**@mayanks:** Hey @ken what do you see in the external view and ideal state?  
**@ken:** Via Zookeeper inspection, or some other way of looking at that data?
Sorry, just started working with Pinot…  
**@mayanks:** Yes ZK inspector  
**@ken:** for external view - brokerResource, table_OFFLINE, or
leadControllerResource?  
**@ken:** for crawldata_OFFLINE, the mapFields{} has every segment state as
“ERROR”  
**@ken:** I can dump JSON here if that’s what you’d prefer, let me know -
thanks!  
**@ken:** Sadly, I’ve restarted the services so I’m worried that the state I’m
seeing now isn’t all that helpful to you. What’s the best way to reset
everything to fresh, and then try to recreate my problem?  
**@ken:** (this is for running Pinot locally, starting up each service via
Bash)  
**@mayanks:** You can nuke all the jvms if this is local and want to start
with clean state  
**@mayanks:** If your data size is large, perhaps the server is OOM’ing?  
**@mayanks:** Segments go in error state when server cannot host them  
**@mayanks:** Server log usually tells why  
**@ken:** Hmm, if I do that and relaunch Zookeeper, I still see my previous
table data in the inspector  
**@ken:** Should I also delete the “PinotCluster” node in ZK?  
**@mayanks:** Yes  
**@ken:** So if OOM is the issue, bumping up server JVM (not
broker/controller) is what I’d need to do, right?  
**@mayanks:** Correct. What’s your total data size? If you load segments
MMAPed (table config) the OOM shouldn’t happen  
**@ken:** Table config says `“loadMode”: “MMAP”, `  
**@mayanks:** Ok  
**@ken:** Total segments *.tar.gz file sizes == 45MB, I had 2GB for server,
bumped to 4GB  
**@ken:** I restarted everything (ZK first, deleted the PinotCluster node,
then everything else) and reimported. Seems to be OK now. i could try deleting
all segments and re-importing again, just to see if that puts me into a weird
state.  
**@ken:** Though I’m happy now trying to craft queries against real data
:slightly_smiling_face:  
**@mayanks:** :+1:  
**@mayanks:** Although, with MMAP, loading of segments should not OOM. Query
processing happens on heap though  
 **@ken:** And if I check the logs, I see various errors logged by the server
(and ZooKeeper). So is there a different approach I should have used to get
rid of all of the old data before doing the re-import? This is all for offline
data.  

###  _#time-based-segment-pruner_

  
 **@snlee:** let’s get the consensus on the time based pruner approach. When I
talked with @g.kishore and @noahprince8 , they had the concern with O(n) naive
approach so @jiapengtao0 did some research and has the implementation using
interval tree. So, I see the following options: 1\. naive approach 2\. keep 2
sorted list (one with start time, the other with end time), use binary search,
and perform intersection 3\. interval tree  
 **@snlee:** @jackie.jxt had some concerns on memory usage using interval tree
and he recommends to start with the naive approach  
 **@snlee:** @steotia  
**@steotia:** @steotia has joined the channel  
 **@mayanks:** Can we make the decision data driven?  
 **@snlee:** @jiapengtao0 can u point the experiments that u did?  
 **@g.kishore:** Why is interval tree memory intensive?  
 **@jackie.jxt:** We need to use segment tree instead of interval tree  
 **@jiatao:** One second, let me rerun the simple bench mark.  
 **@jackie.jxt:** We should not get into a state where we need to manage
millions of (logical) segments in one table  
 **@jackie.jxt:** Everything will break besides this pruner  
 **@jiatao:** Why segment tree instead of interval tree?  
 **@jackie.jxt:** Interval tree is basically the same as sorted list + binary
search  
 **@jackie.jxt:** You need to maintain 2 of them to handle the search for both
start and end  
 **@jiatao:** its augmented binary search tree, and the key is time interval,
so we don't need 2 of them.  
 **@jackie.jxt:** When you sort the intervals, you can only firstly sort on
one of start or end  
 **@jackie.jxt:** If you sort on start firstly, you cannot handle the search
for start because the end is not sorted in the tree  
 **@steotia:** isn't the problem about finding segments with overlapping
intervals. So given each interval from the filter tree, look for overlapping
intervals and corresponding segments. Interval tree sounds like a candidate  
 **@steotia:** segment tree is better when you want to do point queries within
a interval  
 **@steotia:** segment tree is not your traditional balanced BST physically..
it is an array  
 **@steotia:** so not sure how we can use it to find overlapping intervals  
 **@jiatao:** , the interval search tree is introduced in this video.  
 **@jiatao:** I think there's some interval trees named quite similar, so may
have confusions.  
 **@jackie.jxt:** Let me learn about it, based on the wiki, it can handle our
requirement:  
**@steotia:** both segment and interval tree store intervals... I think the
difference is what you want to query... in this case we want to query
overlapping intervals.. which segment tree can't answer  
 **@jackie.jxt:** Before this, I want to discuss if we want to handle the
corner case of ZK failures  
 **@jackie.jxt:** Currently, if some segment ZK metadata is missing due to
temporary ZK failures, we still route the segments. But that is not possible
with this approach  
 **@jiatao:** Why?  
 **@jiatao:** If the ZK metadata is missing, we can set the interval as
default [0, LONG.MAX_VALUE]  
 **@jiatao:** So it won't get filtered out.  
 **@jackie.jxt:** If the ZK metadata is missing, you won't know the existence
of the segment, thus not able to put this default  
 **@jackie.jxt:** Oh, I took it back, you initialize the ZK metadata based on
the online segments  
 **@jackie.jxt:** After reading the wiki, now I understand the algorithm. We
definitely needs more comments explaining the algorithm, or put a pointer to
the wiki  
 **@jackie.jxt:** Do you have the perf number comparing this approach with the
naive one?  
 **@jiatao:** Yeah, i'm running it to print the result, maybe 3 more minutes  
 **@noahprince8:** > We should not get into a state where we need to manage
millions of (logical) segments in one table > What else what break down at
millions of segments?  
 **@jackie.jxt:** ZK, cluster management, segment size check etc  
 **@jackie.jxt:** I think the solution should be grouping multiple physical
segments into one logical segment, and only store ZK metadata for logical
segment, as @g.kishore suggested  
**@noahprince8:** Yeah. May also be trying to make Pinot something it’s not.
Pinot is very good at analyzing terrabytes of time series data. Might be
square peg in a round hole trying to make it support petabytes. Perhaps
traditional parquet files in s3 is better for 3+ month old data. Or something
like that. I still think the segment cold storage and the time based pruners
are nice features to have just for the cost save. Probably most of our tables
can fit into that paradigm.  
**@jackie.jxt:** My concern on the cold storage feature is that one query that
hits the old data could invalid everything on the server cache. Just
downloading all the old segments could take hours, and I don't think Pinot is
designed for queries of such high latency.  
**@jackie.jxt:** Ideally, we should abstract the segment spi so that the query
engine can directly work on parquet files stored in the deep storage without
downloading the whole files  
**@noahprince8:** Yeah, I think 1) having limits on the number of segments you
can query and 2) pruning aggressively helps.  
**@noahprince8:** I agree, the ideal would be having it work directly on
parquet files. Though at what point are you better off just using Presto with
a UNION ALL query with a `WHERE date < 3 months` on one side and `WHERE data >
3 months` on the other side.  
**@g.kishore:** This is not really terabyte vs petabytes, it’s more about
metadata scale  
**@g.kishore:** And specifically for one table  
**@g.kishore:** If you use Presto, a simple file listing will have the same
problem  
**@g.kishore:** In the end, it’s a file list + filtering  
**@noahprince8:** I think it’s both, really. The large amount of data itself
becomes a problem because you can’t store it on normal hard discs. Needs to be
in a distributed fs. With that large of data, if your segments are 200-500 mb,
the metatdata scale also becomes a problem.  
**@g.kishore:** We talked about this rt  
**@g.kishore:** With a segment per day  
**@g.kishore:** We are really looking at 3600 segments for 10 years  
**@noahprince8:** So you can either do the segment group, or just merge the
segments. But either way you do have a long download time.  
**@noahprince8:** Now, if you’re lucky the segment group itself causes a hit
to a separate metadata API which narrows down the number of smaller segments
you need to download.  
**@g.kishore:** Yeah, let’s get there to know what the bottle neck is  
**@noahprince8:** But again I keep asking myself if this isn’t trying to force
Pinot into something it’s not. Are you better off having a bespoke solution
for long term historical that involves an aggressively indexed metadata store
of files in s3. And a query engine that reads directly from s3 (like most
parquet implementations)  
 **@jackie.jxt:** Broker will prune based on logical segments, and server can
map logical segments into physical segments  
 **@jiatao:** The intervalMP is the naive method, performance ratio here means
how much better search tree method compared with naive. First experiment I
fixed the the percentage of resulted segments, and increase the # of
segments/table. ```100 segments/table: intervalMP-> 0.0374 milli seconds,
intervalST-> 0.00539 milli seconds, 6.9 performance ratio, average
segments/table: 100.0, average result size: 8.1, result size percentage 8.1
1000 segments/table: intervalMP-> 0.0767 milli seconds, intervalST-> 0.0093
milli seconds, 8.23 performance ratio, average segments/table: 1000.0, average
result size: 52.0, result size percentage 5.2 10000 segments/table:
intervalMP-> 0.358 milli seconds, intervalST-> 0.056 milli seconds, 6.3
performance ratio, average segments/table: 10000.0, average result size:
501.3, result size percentage 5 100000 segments/table: intervalMP-> 6.16 milli
seconds, intervalST-> 0.86 milli seconds, 7.15 performance ratio, average
segments/table: 100000.0, average result size: 5001.0, result size percentage
5 10000000 segments/table: intervalMP-> 636.1 milli seconds, intervalST->
230.5 milli seconds, 2.75 performance ratio, average segments/table: 1.0E7,
average result size: 500001.6, result size percentage 5```  
 **@jiatao:** The second experiment, I fixed the # segments/table, and
decreases the percentages of resulting segments ```10000000 segments/table:
intervalMP-> 1237.3 milli seconds, intervalST-> 432.7 milli seconds,
2.86performance ratio, average segments/table: 1.0E7, average result size:
5000001.1, result size percentage 50.0 10000000 segments/table: intervalMP->
697.3 milli seconds, intervalST-> 224.6 milli seconds, 3.1 performance ratio,
average segments/table: 1.0E7, average result size: 500001.3, result size
percentage 5.0 10000000 segments/table: intervalMP-> 593.4 milli seconds,
intervalST-> 18.97 milli seconds, 31.27 performance ratio, average
segments/table: 1.0E7, average result size: 50001.0, result size percentage
0.5 10000000 segments/table: intervalMP-> 585.3 milli seconds, intervalST->
3.18 milli seconds, 183.97 performance ratio, average segments/table: 1.0E7,
average result size: 5002.1, result size percentage 0.05 10000000
segments/table: intervalMP-> 582.2 milli seconds, intervalST-> 0.48 milli
seconds, 1203.15 performance ratio, average segments/table: 1.0E7, average
result size: 501.3, result size percentage 0.0050```  
 **@jiatao:** If the resulted # of segments are much smaller than # of all
segments, interval search tree performs much better.  
 **@g.kishore:** great to see this level of details :clap::clap:  
 **@jackie.jxt:** ```10000000 segments/table: intervalMP-> 1237.3 milli
seconds, intervalST-> 432.7 milli seconds, 2.86performance ratio, average
segments/table: 1.0E7, average result size: 5000001.1, result size percentage
50.0``` Really? Selecting 50% of the segments, the naive one is 2.86 times the
tree solution?  
 **@jiatao:** I used Concurrent hash map for naive solution  
 **@jiatao:** So it may not perform as good as a normal HashMap.  
 **@jackie.jxt:** Anyway, I think if we select less segments, the tree
solution is definitely much faster  
 **@jackie.jxt:** Good job  
 **@jackie.jxt:** Based on this experiment, in order to handle large number of
segments, we might also want to optimize the partition based pruner to pre-
group the segments under each partition  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org