You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2020/08/11 02:00:11 UTC

Apache Pinot Daily Email Digest (2020-08-10)

<h3><u>#general</u></h3><br><strong>@katneniravikiran: </strong>Hi All,
I am new to Pinot. I am trying to ingest/upload a 7GB Lineitem TPCH table into Pinot. Entire file is getting uploaded as a single segment.
Does Pinot support any configuration to specify segmentation column(column based on which segments get created from ingested file/data)?
When I explicitly split file into multiple files then multiple segments are getting created. Does Pinot expect pre-segmented data to be ingested/uploaded?<br><strong>@jeevireddy: </strong>@jeevireddy has joined the channel<br><strong>@andrew: </strong>@andrew has joined the channel<br><strong>@andrew: </strong>Hi! I’m evaluating Pinot for the following use case and want to know if it’s a good fit, or any best practices to help achieve it.

• Ingest events for ~1B total users at ~100k/second
• Run aggregation queries on events filtered on *individual user IDs* at ~10k/second, each query completing in &lt; 100ms
What I understand is that the data is organized primarily by time (segments) and secondarily (within a segment) by indexes. In this case, I tried sorting by user ID. To query for a particular user ID, it seems that each segment must be queried, since the data is not consolidated by user. The runtime would be O(s log n) where s is the number of segments in a particular timeframe and n is the number of events per segment.

Thus, it seems that Pinot may not scale when there are tens/hundreds of thousands of segments and may not be a good fit here. However, this use case seems similar to the use cases at Linkedin, such as the “who’s viewed your profile” feature, which also would operate on events for individual users.

Is my understanding correct, and is there anything I’m missing here? Would appreciate any thoughts or resources you could point me to. Thanks!<br><strong>@anthonypt87: </strong>Hi Team! First post! We’re evaluating Pinot for our use case and wanted to get some of your thoughts on if it’s a good fit for our use case and/or best practices to make it happen. 

The main complication we’re running into is we feel that we may need to be able to mutate our data which may not be a good fit for pinot (maybe this can be avoided with some smarter data modeling or some future <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdw-2FcF2F51LKj8tTo0R9Alb88ZzqtDxq8uh1VXj5wNJDD-2FhOUp9hc9O5JFjQYNiyd72Milnf5AiFbpyEnUEOHM7RM9Lpvy8FslGClZj46bb1GZvFaFMq-2F0H29DcTre07Yg-3D-3DXn_l_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxRmISAXL1yNbgD-2FDqVrcfbUl8j6B1msIR6iS6q8DudDUHnFZkpw1NJLxLt9M33o9Oni5eXHNWCoi2OtXOxKoVUCGIcRMQPCGyo9TF1l0J7uZa0DzjFxob4WGuGJ9vyCgE-2FCIF4REkSkboM6netD-2FKPSqqWO-2FH4owrdGfeTqcinQd49oKe02iYITx1c613zQJI-3D>?). We’re attracted to pinot because it’s ability to perform fast aggregation and reduce eng cost from having to do things like precubing data.

• In particular we have two streams of order data (e.g. you can imagine booking details like total price in $, an order id, account id, user name, date, etc) that are flowing into our system. 
• The two streams (let’s call them “Fast Stream” and “Accurate Stream”) of order data may overlap (i.e. the Fast Stream and the Accurate Stream may both have order info for “order 1” but Fast Stream may be the only one that has “order 2” or Accurate Stream may be the only one that has "order 3")
• Ideally we want to merge these streams together such that whenever they overlap (if they overlap), we use the data from Accurate Stream instead because it has richer user details and more accurate reporting of price.
We want to be able to do things like get time based aggregate totals based on account id quickly. Is there a good way to model this since we have two data sources we want to merge?

Thanks so much for your help!<br><h3><u>#random</u></h3><br><strong>@jeevireddy: </strong>@jeevireddy has joined the channel<br><strong>@andrew: </strong>@andrew has joined the channel<br><h3><u>#troubleshooting</u></h3><br><strong>@elon.azoulay: </strong>Yep, under heavy cpu load. There were no gc issues, just latency on the liveness check.<br><strong>@quietgolfer: </strong>If I try to fix historic data with Pinot, at what point would the new data start serving?  E.g. after the whole batch ingestion job completes?  Or is it after each segment gets uploaded?

I'm curious if I can get into a spot where inconsistent data is served if I use a single batch ingestion job.  Does it matter depending on ingestion route (standalone, spark, hive)?<br><strong>@g.kishore: </strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdpzMyfof-2FnbcthTx3PKzMZIvTvz0ZlGzjfnWuiLO3kB-2FQ-3D-3Daq4B_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxRmISAXL1yNbgD-2FDqVrcfbUl4lbQLwg7Bjt0qad9fw-2BGHAd5aZIcG05wXfr1HfkPx8KldVIRDDd21XV4RyZfuzl8y4pnOyU4KxZyFTaggd2sWu-2BO0s6wwWrA90DWoEgV6PS8oBZ1GOYPxR-2B8hZw10HtGlTyIFnvt8WQYTQHWrLFX0IOWN3NbPAUcu15ag01x4-3D><br><strong>@g.kishore: </strong>any time a segment is pushed the maxTime is re-calculated<br><strong>@g.kishore: </strong>then broker splits the rewrites the query into two different queries<br><strong>@g.kishore: </strong>select sum(metric) from table_REALTIME where date &gt;= time_boundary
select sum(metric) from table_OFFLINE where date &lt; time_boundary<br><strong>@quietgolfer: </strong>Ah sorry.  Let's say I already have offline data for a previous date and I onboard a new customer who wants to backfill for that date range.  Without modifying the segment name structure in Pinot, what happens if I run a data ingestion job for that date?<br><strong>@npawar: </strong>`E.g. after the whole batch ingestion job completes? Or is it after each segment gets uploaded?` - you will see new data after each segment gets uploaded.<br><strong>@quietgolfer: </strong>Cool.  I'm assuming this can cause temporary removal or duplicate metric data.<br><strong>@npawar: </strong>yes. only way to avoid that for now, is to push a single segment. but that may not always be practical<br><strong>@g.kishore: </strong>another option is to partition the data based on customerId so that impact is minimal<br><strong>@g.kishore: </strong>you can go as far as creating a segment per day per customer but that will result in a big skew<br><strong>@quietgolfer: </strong>Yea.  I figured that's something we could do longer term.  I was also curious about having other virtual tables that reference most of the same segments but can be used as a transition<br><strong>@pradeepgv42: </strong>Hi, When we are running a query individually we are seeing latencies of the order of 3seconds but when we run for example same query parallely we are seeing latencies to be order of 7-10secs for the third query.
Do the queries run serially? I have one broker and 2 servers as part of my setup, also would adding more servers be helpful in parallelizing better for optimizing query latencies?<br><strong>@pradeepgv42: </strong>QQ, is there a way to optimize/improve the queries of following format (basically time series queries)?
```SELECT DATETIMECONVERT(timestampMillis, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:HOURS'), count(*) as count_0 
FROM table 
WHERE   timestampMillis &lt; 1597104168752 and &lt;some filters&gt;
GROUP BY DATETIMECONVERT(timestampMillis, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:HOURS') 
ORDER BY count(*) desc
LIMIT 100```
numEntriesScannedInFilter: 18325029
numEntriesScannedPostFilter: 10665158

I guess caching on client side is simple way to go from our side to decrease the latency, wondering if there are any alternatives.<br><h3><u>#release-certifier</u></h3><br><strong>@ssubrama: </strong>We can always build it a little at a time and get the functionality we desire as we go along.<br><strong>@yupeng: </strong>@yupeng has joined the channel<br>