You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/02/28 02:00:21 UTC
Apache Pinot Daily Email Digest (2022-02-27)

### _#general_

  
 **@pepp:** @pepp has joined the channel  

###  _#random_

  
 **@pepp:** @pepp has joined the channel  

###  _#troubleshooting_

  
 **@pepp:** @pepp has joined the channel  
 **@diogo.baeder:** Hi guys! Hey, I know that Pinot supports ScalarFunction,
but is it possible to add custom aggregation functions somehow? How can I
aggregate data in a way that I can have logic where rows are inter-dependent?  
**@diogo.baeder:** The reason why I'm asking this is because I have a use case
where I need to aggregate users requests into buckets that have a minimum
distance of 30 minutes between each other (it's how we define "user sessions"
in our project), but I'm currently doing this in Python and it involves a lot
of I/O from Pinot to my application, so I was wondering whether it would be
possible to push this logic and have the aggregation done in the Pinot cluster
itself  
**@mayanks:** See if you can use groovy functions for that  
**@mayanks:** What is the response data size that you are taking out from
Pinot? 40s seems unusually large for a 500ms query  
**@diogo.baeder:** I was able to bring it down to ~12s, where ~10s from that
is just for transferring from Pinot into my application (AWS EU datacenter
into on-premises US datacenter - so it's an understandable bottleneck). I
haven't checked the response data size yet, but I'll check, I bet it's quite a
few MB. How is Groovy able to cope with aggregating rows though? I understand
how it can process one or more columns in a row, but I don't understand how it
can work with multiple rows in the same call.  
**@mayanks:** Yeah may be not groovy. Will grouping by 30min buckets help? If
so, there’s the datetime conversion udf that can be used. If not, Can you
elaborate the requirement a bit?  
**@diogo.baeder:** The problem is that it would be buckets fixed in time, so,
for example, a request done at 14:29:59 would be on a different session than
another one made by the same user at 14:30:01. The idea is that, as soon as a
new request is iterated over but was has more than 30 mins of difference from
the last request, it would be aggregated as a different session. Let me
elaborate a bit better. Imagine these requests by the same user, done in the
same day, at times: 1\. 14:29:59 2\. 14:30:01 3\. 14:45:00 4\. 15:20:00  
**@diogo.baeder:** What we want is to group those 3 first requests into the
same session, containing 3 visits, while the 4th request would be a second
session with only 1 visit.  
**@diogo.baeder:** Another scenario: imagine the same user making 100
requests, one every minute. We want to group all those 100 requests as a
single session, containing 100 visits, no matter how much time in total the
user has been in that session.  
**@diogo.baeder:** The breaking point, for us, is the time between consecutive
requests for the user: if it's more than 30 minutes, we break up into a
different session.  
**@mayanks:** Yeah seems like window function, which Pinot doesn’t have right
now. In the absence of that, the only options I can think of are upstream
processing to mark sessions, or client side processing.  
**@mayanks:** For client side why is the response size huge? What’s the query
and can you process batches in parallel?  
**@diogo.baeder:** The response size is so huge because of the amount of
requests we get. But, honestly, I need to pin down how much data is being
transferred. And we have a plan to migrate our app into AWS@EU too, so the
latency would be dramatically reduced, which would cut down a lot of that
time. Processing in parallel would be a bit too complicated to implement
because we would have to rely on something else than Python (because of the
GIL - multithreading is not efficient at all when it comes to parallel
computations in Python), but that's not the biggest issue we have, the biggest
one is I/O.  
**@mayanks:** By parallel I meant that if there are batches of queries that
are non overlapping you can simply fire multiple queries in parallel and just
concat the results. That would not require complex MT code in python. However,
if the plan is to migrate the app to same aws region, that would be ideal.  
**@diogo.baeder:** Got it. Yeah, we hope that can happen soon. For now,
though, it's not a huge issue, just something I want to optimize for greater
UX. I mean, we have now a use case that responds in ~12s whereas in our old
system that used to take like at least 10 minutes! So not much to complain
there... :smile:  
**@diogo.baeder:** Man, your idea about using buckets is actually not bad!
Actually, I already do that for putting them on a minute-resolution bucket
aggregation, so I was thinking, putting them in 10-minute buckets would be
fine too, because there's just a slight chance I end up getting visits on a
same session when they should actually be on different sessions.  
**@diogo.baeder:** Actually, there's a slight difference that happens in the
numbers we get, but it might not be a problem.  
**@mayanks:** Sounds good  
**@diogo.baeder:** When I do that and use `ToEpochSecondsRounded("timestamp",
600` instead of `ToEpochSecondsRounded("timestamp", 60)`, I'm able to bring
the time to load the data into my app from ~10s to ~6s, and post-processing
from ~4s to ~0.8s, which is quite a bit.  
**@diogo.baeder:** Ah, by the way, we still do use Python's `datetime`
instances in our app, but now that I'm getting seconds back from Pinot instead
of date-time strings, I'm converting them directly from epoch seconds, which
is way faster (as would be expected in most languages, since usually
processing numbers is faster than processing strings, in general).  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org