You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/08/12 02:00:22 UTC

Apache Pinot Daily Email Digest (2021-08-11)

### _#general_

  
 **@ananth.durai:** @ananth.durai has joined the channel  
 **@tahirozdemir34:** @tahirozdemir34 has joined the channel  
 **@tugce.dinc:** @tugce.dinc has joined the channel  
 **@ilker.mutluer:** @ilker.mutluer has joined the channel  
 **@abdullah.velioglu:** @abdullah.velioglu has joined the channel  
 **@ahmet.lekesiz:** @ahmet.lekesiz has joined the channel  
 **@tugce.dinc12:** @tugce.dinc12 has joined the channel  
 **@hatice.ozdemir:** @hatice.ozdemir has joined the channel  
 **@g.kishore:** Hello Everyone, Thank you for being a part of Pinot community
and congratulations on the graduation! Last year, we started with 100 members
and today we are a 1600+ vibrant community. There are awesome things happening
beyond the slack channel and we didn't have a way to communicate important
things to everyone in the community. We've created an Apache Pinot newsletter
with Pinot releases and events ONLY! We recently added a check box to receive
news when you join our slack and many of you have opted in already! However,
we only added the opt-in recently, so members who joined earlier did not have
the opportunity to sign up. We will be sending an email to everyone in the
slack channel with the opportunity to opt in, or you can register . If you
don't, you won't receive any Apache Pinot newsletters in the future.
Congratulations again on the graduation, we can't wait to continue to grow our
community!  
**@aashanand:** Congrats to the pinot team!  
 **@anshuman.bhowmik:** Getting this while starting pinot/third eye for the
first time org.h2.jdbc.JdbcSQLException: Database may be already in use: null.
Possible solutions: close all other connection(s); use the server mode Anyone
knows how to close the connections?  
**@mayanks:** @pyne.suvodeep ^^ Also @anshuman.bhowmik there's a separate TE
slack community  
**@pyne.suvodeep:** Hi @anshuman.bhowmik . Can you try with MySql 5.7? It’s
probably more stable than the h2 setup  
 **@bowenwan:** Hi. I have question regarding using DISTINCTCOUNTHLL 1\.
What's the deviation rate for this aggregation since it says "approximate
distinct count" ? It seems on smaller data size, I don't see any difference
from DISTINCTCOUNT 2\. What should be the use case of it ? 3\. If I want to
use star-tree index, it seems DISTINCTCOUNTHLL is the closest thing to
DISTINCTCOUNT. What could be issue from using DISTINCTCOUNTHLL with star-tree
index ? Because I do care if the result is accurate or not  
**@mayanks:** 1\. You can find error rates for HLL . 2\. Use case is when you
want faster latency for count distinct queries and are ok with aproximations.
3\. StarTree or standalone, HLL is approximation algorithm. You want to study
if the error margin is within your tolerance. Also, the error margin depends
on the storage used, so you can play with that.  
 **@zsolt:** What is the recommended way to migrate a pinot cluster to a
different s3 bucket?  
**@mayanks:** Is this for an offline or realtime table? Also, do you have to
do this without downtime?  
 **@roberto:** hi!! I have a question about querying the database, We are
working with time series and we want to define different kind of time windows
in our queries. Are windowing options implemented out of the box or do we need
to implement them by ourselves?  
**@mayanks:** Hi @roberto are you referring to `window` functions in general?
If so, Pinot doesn't have that yet today. But if are talking about rollups on
different time granularities that can be done. Could you give an example query
of what you are looking for?  

###  _#random_

  
 **@ananth.durai:** @ananth.durai has joined the channel  
 **@tahirozdemir34:** @tahirozdemir34 has joined the channel  
 **@tugce.dinc:** @tugce.dinc has joined the channel  
 **@ilker.mutluer:** @ilker.mutluer has joined the channel  
 **@abdullah.velioglu:** @abdullah.velioglu has joined the channel  
 **@ahmet.lekesiz:** @ahmet.lekesiz has joined the channel  
 **@tugce.dinc12:** @tugce.dinc12 has joined the channel  
 **@hatice.ozdemir:** @hatice.ozdemir has joined the channel  

###  _#troubleshooting_

  
 **@surajkmth29:** Hi Team, I got to know that Pinot provides support for
encryption of data stored in deepstore. A crypter needs to be provided for
this. But I am not able to figure out where to provide the crypter details ,
can anyone help ?  
 **@bajpai.arpita746462:** Hi team, I have been able to deploy apache pinot
with latest master from git , I am able to run the default cluster with
command bin/quick-start-batch.sh but when I am trying to run my own cluster, I
am able to start the zookeeper , but am getting error while running below
service manager command: bin/pinot-admin.sh StartServiceManager -zkAddress
localhost:2181 -clusterName pinot-quickstart -port -1 -bootstrapConfigPaths
${PINOT_DIR}/config/pinot-controller.conf ${PINOT_DIR}/config/pinot-
broker.conf ${PINOT_DIR}/config/pinot-server.conf The error is: Error: option
"-clusterName" cannot be used with the option(s) [-bootstrapConfigPaths,
-bootstrapServices] Although I have been able to run my own cluster
previously, but it is showing error with latest master. any idea regarding the
same?  
**@xiangfu0:** since you provide a broker conf, you don’t need to set
`-clusterName pinot-quickstart` etc in commandline but set everything in the
conf file  
 **@ananth.durai:** @ananth.durai has joined the channel  
 **@tahirozdemir34:** @tahirozdemir34 has joined the channel  
 **@tugce.dinc:** @tugce.dinc has joined the channel  
 **@ilker.mutluer:** @ilker.mutluer has joined the channel  
 **@abdullah.velioglu:** @abdullah.velioglu has joined the channel  
 **@ahmet.lekesiz:** @ahmet.lekesiz has joined the channel  
 **@tugce.dinc12:** @tugce.dinc12 has joined the channel  
 **@hatice.ozdemir:** @hatice.ozdemir has joined the channel  
 **@kangren.chia:** i have an issue with queries not being deterministic. my
table schema looks like this: ``` "tableIndexConfig": {
"invertedIndexColumns": [ "user", "ts", "cell" ], "sortedColumn": [ "user" ],
"loadMode": "MMAP" }``` query looks like this: ```select user, count(*) from
events where ts between {start_ts} and {end_ts} and cell between 500 and 550
group by user having count(user) >= 24 limit 10000``` logs look like this:
```Processed
requestId=51,table=events_OFFLINE,segments(queried/processed/matched/consuming)=66/66/66/-1,schedulerWaitMs=1,reqDeserMs=0,totalExecMs=388,resSerMs=0,totalTimeMs=390,minConsumingFreshnessMs=-1,broker=Broker_172.20.0.4_8099,numDocsScanned=1039299,scanInFilter=206881916,scanPostFilter=1039299,sched=fcfs
Processed
requestId=52,table=events_OFFLINE,segments(queried/processed/matched/consuming)=66/66/66/-1,schedulerWaitMs=0,reqDeserMs=1,totalExecMs=424,resSerMs=2,totalTimeMs=427,minConsumingFreshnessMs=-1,broker=Broker_172.20.0.4_8099,numDocsScanned=1039299,scanInFilter=206881916,scanPostFilter=1039299,sched=fcfs
Processed
requestId=53,table=events_OFFLINE,segments(queried/processed/matched/consuming)=66/66/66/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=407,resSerMs=0,totalTimeMs=408,minConsumingFreshnessMs=-1,broker=Broker_172.20.0.4_8099,numDocsScanned=1039299,scanInFilter=206881916,scanPostFilter=1039299,sched=fcfs
Processed
requestId=54,table=events_OFFLINE,segments(queried/processed/matched/consuming)=66/66/66/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=416,resSerMs=0,totalTimeMs=418,minConsumingFreshnessMs=-1,broker=Broker_172.20.0.4_8099,numDocsScanned=1039299,scanInFilter=206881916,scanPostFilter=1039299,sched=fcfs```  
 **@kangren.chia:** is this expected or am i doing something wrong here?  
 **@mayanks:** Sorry, which part is inconsistent?  
 **@kangren.chia:** ```$ time python3 queries.py found 6812 userids $ time
python3 queries.py found 6782 userids $ time python3 queries.py found 6895
userids```  
**@mayanks:** Can you remove the `having` clause to check if the result
becomes consistent?  
**@kangren.chia:** ok will try  
**@kangren.chia:** yes, removing the `having` clause gives me deterministic
results  
**@kangren.chia:** the query becomes a low slower and i need to increase the
`limit x` of the query and do the filtering on the client side though  
**@kangren.chia:** looking at  
**@kangren.chia:** > We can also push certain having clauses to be processed
on the server side (instead of on the broker side after merging all the
results for each group) to reduce the amount of data sent from server to
broker. is this the reason for nondeterminism?  
**@mayanks:** It could be because of that or because the servers may be
trimming results before sending back to broker.  
**@mayanks:** That happens in case of low selectivity queries.  
**@kangren.chia:** another data point: if the `cell between 500 and 550` is
reduced to a smaller range, i get deterministic results  
**@kangren.chia:** in any case for my use case i always want deterministic
results. is there a configuration or some other statement i can add to the sql
statement to ensure this?  
**@kangren.chia:** btw, i am running this with the docker-compose quickstart
on my machine, so it’s just one server/container  
**@mayanks:** try increasing the `limit` in your query  
**@kangren.chia:** i tried that and it works! although it’s kind of non-
intuitive that the final result (6k) is less than 10k, but increasing the
limit to 100k increases the final result to a deterministic value (42k)  
**@kangren.chia:** thanks for your help @mayanks  
**@kangren.chia:** i would’ve thought that `limit` is applied after `group by
+ having`  
**@kangren.chia:** curious for the rationale where it is applied before  
**@mayanks:** Yeah, the reason for trimming is in the broker log, see the
`numDocsScanned`  
**@mayanks:** Your query is selecting `1039299` rows  
**@kangren.chia:** i think i’m missing something. how did the number 1,039,299
clue you in when my limit is 10,000?  
**@kangren.chia:** ```# with the increased limit of 100k Processed
requestId=90,table=events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/198/-1,schedulerWaitMs=0,reqDeserMs=1,totalExecMs=1331,resSerMs=2,totalTimeMs=1334,minConsumingFreshnessMs=-1,broker=Broker_172.21.0.4_8099,numDocsScanned=3116947,scanInFilter=624231605,scanPostFilter=3116947,sched=fcfs
# with the previous limit of 10k Processed
requestId=91,table=events_OFFLINE,segments(queried/processed/matched/consuming)=198/198/198/-1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=1159,resSerMs=1,totalTimeMs=1160,minConsumingFreshnessMs=-1,broker=Broker_172.21.0.4_8099,numDocsScanned=3116947,scanInFilter=624231605,scanPostFilter=3116947,sched=fcfs```  

###  _#pinot-s3_

  
 **@hrsripad:** @hrsripad has joined the channel  

###  _#pinot-dev_

  
 **@surajkmth29:** Hi Team, I got to know that Pinot provides support for
encryption of data stored in deepstore(HDFS). A crypter needs to be provided
for this. But I am not able to figure out where to provide the crypter details
, can anyone help ?  
**@g.kishore:** You need to provide this configuration in Pinot controller and
server and ensure that the crypter classes are available in the classpath  
**@surajkmth29:** Hi @g.kishore is there a doc or wiki that explains the
process ? I don't see anything related to this  
**@g.kishore:** Looks like we don’t have docs for that  
**@g.kishore:** May be some test cases  

###  _#community_

  
 **@hrsripad:** @hrsripad has joined the channel  

###  _#metadata-push-api_

  
 **@hrsripad:** @hrsripad has joined the channel  

###  _#getting-started_

  
 **@orajason:** @orajason has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org