You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/09/11 02:00:17 UTC

Apache Pinot Daily Email Digest (2021-09-10)

### _#general_

  
 **@alec:** @alec has joined the channel  
 **@pascal.jerome:** @pascal.jerome has joined the channel  
 **@karinwolok1:** :mortar_board: I don't know about all of you, but I'm still
riding the Apache Pinot graduation wave :heart_eyes: :wine_glass:
:mortar_board: Just wanted to take a moment to shout out our amazing Apache
Pinot contributors. Thank you for all that you do and for being such a big
part of this movement!!! :heart: As Pinot continues to grow, you'll be
touching the lives of many (developers and end-users everywhere :smiley: )
Here's a blog post by my colleague @allison that highlights many (definitely
not all) of these awesome folks who helped bring this project to where it is
today  @jackie.jxt @xiangfu0 @npawar @g.kishore @mayanks @ssubrama @slack1
@snlee @jlli @steotia @yupeng @miliang @kharekartik @tingchen @sanket
@chinmay.cerebro @amrish.k.lal @hbwang89 @moradi.sajjad @jiapengtao0 @shgandhi
@elon.azoulay @jamesyf.shao @chethanu.tech @guo.980 @yash.agarwal @pradeepgv42
@ken @joey @vananth22 @s.aditya.1811 @atri.sharma For the full list of
contributors, check out the Apache Pinot github (and give it a :star: if you
haven't already!)  
**@ebuth:** @ebuth has joined the channel  

###  _#random_

  
 **@alec:** @alec has joined the channel  
 **@pascal.jerome:** @pascal.jerome has joined the channel  
 **@ebuth:** @ebuth has joined the channel  

###  _#troubleshooting_

  
 **@kangren.chia:** has anyone seen the client become the bottleneck for their
queries before? i’m seeing `timeMs=~1000ms` on my broker, but it takes 2/3
seconds on python/java clients on that single line to execute the query e.g.
```# python curs.execute(sql_query) # java ResultSetGroup pinotResultSetGroup
= pinotConnection.execute(pinotClientRequest);```  
**@mayanks:** What's the size of the response?  
**@kangren.chia:** this is for a query that fetches 400k ints  
**@kangren.chia:** is the response size in the broker logs?  
**@kangren.chia:** oh yeah it is, `ResponseSize=6000433`  
**@mayanks:** I am guessing there's some JSON rendering/conversion happening
that might be taking time cc: @jackie.jxt  
**@kangren.chia:** I can’t use idset and in_subquery directly to have all the
work happen in the cluster so I need to split it into 2 queries on the client
side  
**@kangren.chia:** Would be nice if idset could work with queries that have a
GROUP BY + HAVING clause  
**@kangren.chia:**  
**@kangren.chia:** for the python client, the slowness definitely comes from
the client: ```def execute_query(query): start = timer() curs.execute(query)
end = timer() print(f"query took {end-start}") return curs.fetchall()```
```python3 -m cProfile query.py Ordered by: cumulative time ncalls tottime
percall cumtime percall filename:lineno(function) 208/1 0.015 0.000 48.651
48.651 {built-in method builtins.exec} 811465/5 0.907 0.000 47.301 9.460
db.py:41(g) 405731/2 0.519 0.000 42.400 21.200 db.py:52(g) 2 0.368 0.184
42.400 21.200 db.py:369(fetchall) 405729 0.396 0.000 41.531 0.000
db.py:393(__next__) 405729 0.348 0.000 40.211 0.000 db.py:345(fetchone) 405731
39.863 0.000 39.863 0.000 {method 'pop' of 'list' objects} 2 0.000 0.000 4.900
2.450 db.py:263(execute)``` @xiangfu0  
**@xiangfu0:** hmm, it may comes from the results json parsing  
**@kangren.chia:** does broker return JSON only? is there a more efficient way
to retrieve data?  
**@mayanks:** The broker currently returns JSON, and I have also seen JSON
parsing being slow for really large response sizes  
**@ken:** We also noticed JSON parsing being the bottleneck for large
responses. It would be useful to have a binary/serialized response format,
similar to how Solr supports either xml or binary.  
**@ken:** Though this definitely can create compatibility issues, when you
want to extend the response format. At the time this was implemented for Solr,
Avro didn’t exist but would have been a better option than their own custom
format.  
**@mayanks:** Yeah, it should be possible to send back serialized response
that client side library can provide abstractions to iterate on etc. And yeah,
I can see how it opens up the door for compatibility issues. Historically
speaking though, the Pinot response has not changed that frequently  
 **@mrpringle:** Do we have any way to control skipUpsert when using presto on
top of pinot?  
 **@alec:** @alec has joined the channel  
 **@pascal.jerome:** @pascal.jerome has joined the channel  
 **@luisfernandez:** hey friends, we are trying to setup pre aggregation for
one of realtime tables in our cluster for this schema ```{
"schemaName":"etsyads_metrics", "dimensionFieldSpecs":[ { "name":"shop_id",
"dataType":"LONG" }, { "name":"listing_id", "dataType":"LONG" } ],
"metricFieldSpecs":[ { "name":"cost", "dataType":"INT", "defaultNullValue":0
}, { "name":"impression_count", "dataType":"INT", "defaultNullValue":0 }, {
"name":"click_count", "dataType":"INT", "defaultNullValue":0 }, {
"name":"order_count", "dataType":"INT", "defaultNullValue":0 }, {
"name":"revenue", "dataType":"INT", "defaultNullValue":0 } ],
"dateTimeFieldSpecs":[ { "name":"serve_time", "dataType":"LONG",
"format":"1:SECONDS:EPOCH", "granularity":"1:HOURS" } ] }``` we were reading
this  and we set this as our noDictionaryMetrics ``` "noDictionaryColumns": [
"click_count", "order_count", "impression_count", "cost", "revenue" ],
"aggregateMetrics" : true``` but we still see multiple rows for the same
dimension values in the same hour, does anyone have some guidance as to why
this may be happening?  
**@jackie.jxt:** Is this a realtime table?  
**@luisfernandez:** that is right  
**@jackie.jxt:** Within each consuming segment there will be unique dimension
values  
**@jackie.jxt:** If you have multiple stream partitions, then there will be
multiple consuming segments  
**@jackie.jxt:** Also when the previous consuming segment gets flushed, the
next one will have the same dimension values again  
**@ebuth:** After we truncated our time values to the hour (before sending
them to pinot), we did see aggregated values as expected. It almost seemed
like pinot was including our time value as a dimension for the purposes of
aggregation.  
**@ebuth:** And ignoring the “granularity” setting.  
**@jackie.jxt:** The `granularity` field within the `dateTimeFieldSpecs` is
just for description purpose as of now, and pinot won't automatically truncate
the time value  
**@jackie.jxt:** You can use the ingestion transform to truncate the time
value  
 **@ebuth:** @ebuth has joined the channel  
 **@dadelcas:** Hey, I'm having an issue with hybrid tables. It looks like
RealtimeToOfflineSegmentsTask was not moving all the records in the realtime
segment, son for example I've got 4mill records in the realtime table but I
end up with only 600 in the offline table after the realtime segments are
deleted. Surely this is a configuration issue in my table. I was wondering
whether this could be a problem between the table retention periods and the
offline flow buffer configuration.  
**@dadelcas:** In fact the realtime segments in S3 are few MB each whereas the
offline segments are nearly few KB  
**@npawar:** can you share both table configs?  
**@npawar:** the realtime to offline job will move data from realtime to
offline, periodically, 1 window at a time. So if your realtime table has
several time windows, it can take several runs before all data is moved over.
If the realtime table’s retention kicks in before that, it will get deleted.
The 2 processes aren’t aware of each other  

###  _#getting-started_

  
 **@sina.tamizi:** Hi team, I am working on a project for realtime speed test
calculation. I get the speed test data from devices with kafka ingestion. Once
they are in Pinot the following calculations need to be performed: \- peak
hour 7pm -11 pm data to be selected. \- data comes in different time stamp,
the average speed needs to be calculated every hour between 7 pm to 11 pm
everyday E.g: 7-8 pm average , 8-9 average, 9-10 average and 10 -11 pm
average. ( the average data for every hour should be available as soon as the
1 hour windows is completed ) \- 4 average data needs to be stored into
another table where we would have 4 sample data points per day. \- from the
second table past 14 days data need to be selected and 3rd worse speed should
be reported and stored into another table. Both these two tables would be my
reports. The question is if pinot is the suitable platform to do these sort of
calculations ? What would be the best way to run ETL jobs or tasks for run the
query to do the calculations ? I have already done this with InfluxDB, however
I would like to design/implement this with Pinot. Note that I also have other
use cases with the same data where I need the data to be reported on realtime.
Thank you in advance for your help.  
**@g.kishore:** You can achieve this with Pinot but the ETL task needs to be
scheduled and maintain outside of Pinot. If you have Kafka create a topic for
each table.. schedule the task using a scheduler like Airflow. I each task,
query Pinot and write the output to another Kafka topic..  
**@sina.tamizi:** @g.kishore thanks for your help. Unfortunately, the Kafka
server is in another environment and I simply consume the data from it. I am a
bit hesitant to introduce a new Kafka cluster just for writing data from one
table to another. is there any other way ? it would be great if Pinot could
support INSERT TO  
**@g.kishore:** we dont have that today, you can do it via presto/trino cc
@elon.azoulay  
 **@elon.azoulay:** @elon.azoulay has joined the channel  
 **@alec:** @alec has joined the channel  
 **@tiger:** Does Pinot automatically delete the generated indices from the
servers after deleting a segment? I'm running into an issue where I would
delete the segment through the REST API, but it leaves the index files under
PinotServer/index behind. The indices built up over time from my test tables
and now the servers are out of disk space  
**@jackie.jxt:** The indices should be in the same folder as the segment, and
removed along with the segment. Can you please check what files are left over
on the servers?  
**@tiger:** It seems that all the segments are still remaining in
PinotServer/index. I think my cluster somehow ended up in a weird state where
it shows something like 1000/0 segments in the controller UI for that table.  
**@tiger:** It doesn't show the segments anymore through the UI but still
returns the segments when viewing them through the REST API. Trying to delete
through the REST API again seems to do nothing now.  
**@jackie.jxt:** Can you please check the idealstate and external view of the
table?  
**@tiger:** The ideal state is empty. The external view actually shows all of
the segments as OFFLINE  
**@jackie.jxt:** The external view still exist?  
**@jackie.jxt:** Can you check if there is any ERROR in the server log?  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org