You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2020/08/21 02:00:07 UTC
Apache Pinot Daily Email Digest (2020-08-20)

<h3><u>#general</u></h3><br><strong>@katneniravikiran: </strong>Pinot is taking long time to import data when the data size is huge. I am using "standalone" data load job. Trying with 80GB TPCH Lineitem data split into 600 files(each file is around 130MB). Creating segment file is taking around 3 hours on a 4 CPU 64GB RAM machine.  Is this expected behavior?<br><strong>@mayanks: </strong>How many controllers do you have? Are you pushing files sequentially?<br><strong>@mayanks: </strong>Using deep-store with segment-uri push will help reduce the time, by avoiding to have to push the actual payload<br><strong>@katneniravikiran: </strong>Two controllers<br><strong>@katneniravikiran: </strong>Can you help in finding the documentation for "deep-store with segment-uri push" ?<br><strong>@mayanks: </strong>Here's a sample: <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdo5Vml3IIBLs7JiV3aA7xH6E1xuyunoVEcnqa9dbWn9CIJQSKp-2BZWo-2BWye-2FipnOZiQLReT6c2Mzg50KdMUPKcFdVOMaIzyscgjS7oqxfmp915p3jVDV-2FG594Eqj0lJbWEW6V77mINHj3FYdWSgpfbXYrwXW_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTylyTc9khVGqOPJLveBssI2aIj6ojn6scFUopLefrXk7X8JWqO8jPV0rvh2IlKpNnSm-2FbwDtyAuFXi-2Bx2i6pHbTGZf4sbtR6kGGxtojcVwKFf9eNnMRQR7BvR5YsCJpjSpGIbMrppIPX8rGoPW3Yt6p0PMEhU5GepOuvZDGv5o2kDcxPVocWlViieqE6Y-2FqGWw-3D><br><strong>@katneniravikiran: </strong>Is option other than using HDFS?<br><strong>@mayanks: </strong>yeah, you can also use gcs, s3<br><strong>@katneniravikiran: </strong>Ok,thanks<br><strong>@mayanks: </strong>May I ask what are you trying to achieve? Is this a benchmark?<br><strong>@mayanks: </strong>Actually, looks like even when using deep-store, the controller may still need to download the segments (metadata push may not be supported yet)<br><strong>@katneniravikiran: </strong>Yes, we are trying to benchmark, Presto-Pinot combination for 10GB to 200GB TPCH data. We want good query performance with fast loading capability. When compared with other OLAP dbs, Pinot seems to be taking long time for loading data. One observation is, the standalone job is using a single CPU(not sure how many threads) for a single upload job, even when there are having multiple files in import folder. Other OLAP dbs seem to load data using more than one CPU. Is there any setting to make Pinot import job use more than one CPU? Using HDFS ,S3 or GC is not in the scope of bench marking, We want to minimize the dependency on Hadoop or other Big systems because the data sizes we are targeting are not truly BigData terriroty.<br><strong>@mayanks: </strong>Ok, then we need to understand where the time is spent. Is it on index generation or actual push? <br><strong>@katneniravikiran: </strong>Indexing is taking time. Push is relatively faster.<br><strong>@mayanks: </strong>Hmm then it should be easy to make the job multi threaded, if it isn’t already<br><strong>@g.kishore: </strong>There is a parallelism setting <br><h3><u>#feat-presto-connector</u></h3><br><strong>@christian: </strong>@christian has joined the channel<br><h3><u>#troubleshooting</u></h3><br><strong>@yash.agarwal: </strong>I am using presto for querying and joining results from Pinot, What is the recommended approach to do multiple aggregations like the following in single query ?
```select channel,
    sales_date,
    sum(sales) as sum_sales,
    sum(units) as sum_units
from pinot.default.sales
group by channel, sales_date```
Currently presto is trying to fetch raw values for all the columns.<br><strong>@yash.agarwal: </strong>Also how can I use custom Pinot UDFs like segmentPartitionedDistinctCount in presto queries ?<br><strong>@g.kishore: </strong>@yash.agarwal yes, @fx19880617 lets enable allow-multiple-aggregations by default in presto pinot connector<br><strong>@fx19880617: </strong>will do<br><strong>@mailtobuchi: </strong>Hey everyone, `DISTINCTCOUNT` queries on raw data from realtime tables seems to be very slow. Tried the HLL approximation but that didn’t help. If we were to be okay with approximated results, would you recommend `Theta Sketches`? Is that generally faster than the HLL?<br><strong>@mayanks: </strong>HLL is faster than T/S<br><strong>@mayanks: </strong>T/S is better if you want to do set operations like intersect/union/difference<br><strong>@mayanks: </strong>HLL or T/S helps if pre-aggregate, which you can't for RT<br><strong>@mailtobuchi: </strong>Hmm.. For most of our queries, both `DISTINCTCOUNT`  and `HLL` are equally slow. Are there any optimizations that we can do to improve the latencies?<br><strong>@mayanks: </strong>what's the numDocsScanned?<br><strong>@mayanks: </strong>A good feature ask `Aggregating HLL T/S derived columns during consumption` (cc: @jackie.jxt)<br><strong>@jackie.jxt: </strong>We should support all the aggregations available in `ValueAggregator` for aggregation during consumption<br><strong>@jackie.jxt: </strong>FYI, that's the aggregations supported by star-tree<br><h3><u>#pinot-0-5-0-release</u></h3><br><strong>@tingchen: </strong>@fx19880617 can you take look at <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMSfW2QiSG4bkQpnpkSL7FiK3MHb8libOHmhAW89nP5XK4rP-2BkFe5YEFfdRMoaAM6kg-3D-3DosIx_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTylyTc9khVGqOPJLveBssI2gTFYpV9Y9Za9IhPmwUCSGxYYjYF1ZdZvjwwWiSIkDyBRsYDIrn6BgbGi-2B8PYNgOppkWVDt7qHyE6Yo-2FEu3ElEaI7OROfz6-2FeWqfn6ng5BQSnjmXeJODYM1jSqye-2F7ghIYTcJCR93RJKyA9C4gRhaMHFbJNluLdWoxXaA-2BoWg6a0-3D>? for the license and notification file changes?<br><strong>@tingchen: </strong>thanks.<br><strong>@fx19880617: </strong>Sure<br><h3><u>#lp-pinot-poc</u></h3><br><strong>@andrew: </strong><!here> thanks for everyone’s help so far. I invited you all to a Github project where i’ve put the cluster setup. Let me know if you are able to provide further assistance and if you have any questions.<br><strong>@g.kishore: </strong>thanks Andrew, we will try it this week and get back to you<br>