You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/11/10 02:00:22 UTC

Apache Pinot Daily Email Digest (2021-11-09)

### _#general_

  
 **@akshay13jain:** @akshay13jain has joined the channel  
 **@zaid.mohemmad:** @zaid.mohemmad has joined the channel  
 **@karinwolok1:** Meetup tomorrow (intro level), if anyone wants to join.
Feel free to invite friends as well if you think they would benefit from
Apache Pinot. :heart:  
**@karinwolok1:** :speaker: Reminder - this conference has an open CFP
(looking for speakers). CFP deadline is this week. Be a conference speaker and
share your story about Apache Pinot  
**@dino.occhialini:** @dino.occhialini has joined the channel  
 **@scott.cohen:** @scott.cohen has joined the channel  
 **@aaron.weiss:** @aaron.weiss has joined the channel  

###  _#random_

  
 **@akshay13jain:** @akshay13jain has joined the channel  
 **@zaid.mohemmad:** @zaid.mohemmad has joined the channel  
 **@dino.occhialini:** @dino.occhialini has joined the channel  
 **@scott.cohen:** @scott.cohen has joined the channel  
 **@aaron.weiss:** @aaron.weiss has joined the channel  

###  _#troubleshooting_

  
 **@nair.a:** Hey team, i have few questions, can someone help ? 1) Queries
are not returning results most of the time. Upon checking broker logs found
the following : ```Failed to find servers hosting segment:
mytable_0_8_20211029T2056Z for table: mytable_REALTIME (all ONLINE/CONSUMING
instances: [] are disabled, but find enabled OFFLINE instance: Server_ip_8098
from OFFLINE instances: [Server_ip_8098], not counting the segment as
unavailable)``` is this query timeout case? 2) I have set,
flush.threshold.size to 10mn. But segments are getting created with lesser
rows ( Total docs: 3.4mn). Is this expected? 3) What type of index is
recommended on Realtime table with upsert mode on ? 4) In upsert mode, any
limitation on "comparison time column" , i.e timestamp format, granularity? my
table date column is in yyyyMMddHH format. comparison time column will be in
timestamp format yyyy-MM-dd HH:mm:ss ``` { "upsertConfig": { "mode": "FULL",
"comparisonColumn": "anotherTimeColumn" } }``` 5) Queries are timing out at
10secs, even after changing the values at broker and server level. anyother
configs needs to be changed ? pinot.broker.timeoutMs
pinot.server.query.executor.timeout  
 **@alihaydar.atil:** Hello everyone, I am using version 0.7.1. I am trying to
create a hybrid table. Do i have to put controller.task.frequecyInSeconds in
my controller config file? it says it is deprecated in configuration
reference.  
**@xiangfu0:** Not necessarily  
**@xiangfu0:** You just need to follow doc to create the corresponding real-
time table and offline table  
**@alihaydar.atil:** @xiangfu0 Thanks)  
 **@akshay13jain:** @akshay13jain has joined the channel  
 **@zaid.mohemmad:** @zaid.mohemmad has joined the channel  
 **@dadelcas:** Hey team, I've got an Avro schema which contains an array of
records in a child field. I want to convert this to JSON during ingestion so
I've added a transformation for this column to my realtime table. I've
specified `$` as my complex type delimiter because I've got some Groovy
transformations that I need to apply to other columns and is the only
delimiter I can use to make my field names compatible with Groovy identifiers.
My config looks like: ```... "complexTypeConfig": { "delimiter": "$", ... },
"transformConfigs": [ ... "columnName": "some_field", "transformFunction":
"json_format(parent_field$some_field)" ... ], ...```  
**@dadelcas:** This is not working for me. It seems messages are dropped
because the json_format function can't be applied  
**@dadelcas:** Any pointers on how I should do this?  
**@dadelcas:** By the way, I've tried using `__` (double underscore) as my
complex type delimiter but the complex type transformer didn't like that and
was unable to extract the values resulting in null columns. I didn't spot
anything strange in the code so I was wondering why this delimiter can't be
used  
**@g.kishore:** i think $ sign is causing some issues  
**@g.kishore:** try escaping  
**@dadelcas:** The endpoint rejects $ and wont let me create the table. I've
tried adding \ in front of it and I've got a different error  
 **@vibhor.jain:** Hi Team, We have a hybrid table for our analytics use case
and were using UPSERT for REALTIME table. It was working perfectly fine in
0.8. When minion was moving data to OFFLINE, we were using mergeType: "dedup"
and duplicates were getting eliminated in OFFLINE flow also. When we upgraded
to 0.9, the UPSERT is no more supported for hybrid table. This  is blocking
our table deployment. We understand UPSERT cannot work for OFFLINE table but
why is it blocked for hybrid tables? Can someone clarify if we are missing
something here?  
**@mayanks:** If a row is being upserted (via RT ingestion today), but the
previous row for the pk has been moved to offline part, then the upsert won't
work.  
**@mayanks:** @walterddr, we might relax this check via a config. Reason being
there are cases where it may be OK to limit upsert to RT retention time.  
**@mayanks:** @vibhor.jain Please note though, 0.9 is not officially released
yet.  
**@walterddr:** relaxing this check for now:  
**@npawar:** according to me, we should not support the hybrid table + upsert.
this particular case is an exception ( a combination of the dedup
functionality and realtime retention) and there’s still cases where it won’t
work (as pointed by mayank). It will not work properly for majority of the
usecases  
**@walterddr:** what would be the error message for the failure use case
Mayank mentioned? I guess as long as the task itself will error out with
proper error message indicating the issue we should be ok to remove this
constrain from the validation phase.  
**@walterddr:** (or we can provide a warning log? i dont know this would be
off too much help but worth at least logging it somewhere)  
**@npawar:** no error message. upsert just won’t work  
**@walterddr:** @vibhor.jain can you describe exactly what behavior you want
to achieve with this realtime to offline transfer together with upsert? I am
not 100% sure we sorted out all the scenarios  
 **@luisfernandez:** How do I know if a segment is too big ?  
**@mayanks:**  
**@luisfernandez:** :pray: thank you!  
 **@walterddr:** @walterddr has joined the channel  
 **@luisfernandez:** in the logs i’m observing ```2021-11-09 12:53:00 Slow
query: request handler processing time: 441, send response latency: 1, total
time to handle request: 442 2021-11-09 12:53:00 Processed
requestId=1975257,table=etsyads_metrics_REALTIME,segments(queried/processed/matched/consuming)=46/46/46/1,schedulerWaitMs=0,reqDeserMs=0,totalExecMs=441,resSerMs=0,totalTimeMs=441,minConsumingFreshnessMs=1636480380211,broker=Broker_pinot-
broker-1.pinot-broker-
headless.pinot.svc.cluster.local_8099,numDocsScanned=20584,scanInFilter=0,scanPostFilter=123504,sched=fcfs,threadCpuTimeNs=0```
i was able to then find the request id in the broker and got some more info:
```requestId=1976569,table=ads_metrics_REALTIME,timeMs=234,docs=17731/290711208,entries=0/106386,segments(queried/processed/matched/consuming/unavailable):46/46/46/1/0,consumingFreshnessTimeMs=1636480906334,servers=1/1,groupLimitReached=false,brokerReduceTimeMs=0,exceptions=0,serverStats=(Server=SubmitDelayMs,ResponseDelayMs,ResponseSize,DeserializationTimeMs,RequestSentDelayMs);pinot-
server-1_R=0,233,7479,0,-1,offlineThreadCpuTimeNs=0,realtimeThreadCpuTimeNs=0,query=SELECT
product_id, SUM(click_count), SUM(impression_count), SUM(cost),
SUM(order_count), SUM(revenue) FROM ads_metrics WHERE user_id = 13133627 AND
serve_time BETWEEN 1633924800 AND 1636520399 GROUP BY product_id LIMIT 6000```
is there any way i could tell from these logs why this is being slow (?) only
thing I can see is the `scanPostFilter=123504` which may happen because of the
group by i believe we currently do not have any indexes into that product_id
column, would adding one speed up things in any way?  
**@richard892:** could you get a profile from the server process while
querying it? e.g. `jcmd <pid> JFR.start duration=60s filename=server.jfr` then
copy the jfr file off the box? or if you have async-profiler installed already
that would be better  
**@luisfernandez:** can i profile with jvisualvm?  
**@richard892:** preferably not, it's super high overhead  
**@richard892:** and inaccurate  
**@richard892:** do you have a JDK with jcmd?  
**@richard892:** if you know the pid of the server process, profiling with JFR
is as easy as the command above  
**@luisfernandez:** i do have jcmd i think  
**@luisfernandez:** i did  
**@luisfernandez:** `pidof java` can i do that lol  
**@luisfernandez:** seems like there’s no `ps aux` in this machine  
**@luisfernandez:** i’m using whatever configuration is there in the helm
chart  
**@richard892:** jps should identify the server process  
**@luisfernandez:** oh yea i think we have somethin  
**@luisfernandez:** so how do i read the server.jfr  
**@richard892:** you can load it in JMC  
**@richard892:** download it from the server first  
**@luisfernandez:**  
**@luisfernandez:** is something like this what i’m supposed to see in jmc  
**@richard892:** yes it gives rule based advice  
**@richard892:** there won't be anything sensitive in the file, if possible
please send it to me privately and I'll take a look tomorrow  
**@richard892:** the best place to look is method profiling  
 **@dino.occhialini:** @dino.occhialini has joined the channel  
 **@scott.cohen:** @scott.cohen has joined the channel  
 **@aaron.weiss:** @aaron.weiss has joined the channel  

###  _#pinot-dev_

  
 **@akshay13jain:** @akshay13jain has joined the channel  
 **@xiangfu0:** seems the kinesis test doesn’t work again @kharekartik  
 **@xiangfu0:** Tried upgrade localstack version, but not working this time.
Maybe let’s not enable it by default?  
 **@kharekartik:** yeah we can disable the integration test for now can you
send me the error log  
 **@xiangfu0:** It's same as last time the startKinesis method does move on  
 **@xiangfu0:** ```
cloud.localstack.docker.exception.LocalstackDockerException: Could not start
the localstack docker container. at
cloud.localstack.Localstack.startup(Localstack.java:104) at
org.apache.pinot.integration.tests.RealtimeKinesisIntegrationTest.startKinesis(RealtimeKinesisIntegrationTest.java:224)
at
org.apache.pinot.integration.tests.RealtimeKinesisIntegrationTest.setUp(RealtimeKinesisIntegrationTest.java:135)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566) at
org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:108)
at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:523) at
org.testng.internal.Invoker.invokeConfigurations(Invoker.java:224) at
org.testng.internal.Invoker.invokeConfigurations(Invoker.java:146) at
org.testng.internal.TestMethodWorker.invokeBeforeClassMethods(TestMethodWorker.java:166)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:105) at
org.testng.TestRunner.privateRun(TestRunner.java:744) at
org.testng.TestRunner.run(TestRunner.java:602) at
org.testng.SuiteRunner.runTest(SuiteRunner.java:380) at
org.testng.SuiteRunner.runSequentially(SuiteRunner.java:375) at
org.testng.SuiteRunner.privateRun(SuiteRunner.java:340) at
org.testng.SuiteRunner.run(SuiteRunner.java:289) at
org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at
org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) at
org.testng.TestNG.runSuitesSequentially(TestNG.java:1301) at
org.testng.TestNG.runSuitesLocally(TestNG.java:1226) at
org.testng.TestNG.runSuites(TestNG.java:1144) at
org.testng.TestNG.run(TestNG.java:1115) at
com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66) at
com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:109)```  
**@xiangfu0:** I’ve merged the PR:  to disable KinesisTest for now. Please
rebase your PRs if you are encountering the CI failure.  

###  _#getting-started_

  
 **@akshay13jain:** @akshay13jain has joined the channel  
 **@bagi.priyank:** Does the query console only show limited results for a
query? I am wondering why I am seeing only some rows in results to query like
```SELECT col1, col2, col3, DISTINCTCOUNT(col4) AS distinct_col4 FROM table
GROUP BY col1, col2, col3``` the star-tree index looks like ```
"starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "col1", "col2", "col3"
], "skipStarNodeCreationForDimensions": [], "functionColumnPairs": [
"DISTINCTCOUNT__col4" ], "maxLeafRecords": 1 } ],``` can i also add
`DistinctCountHLL__col4` and `DistinctCountThetaSketch__col4` to
`functionColumnPairs` and evaluate the performance for all 3 for this query?  
 **@jackie.jxt:** Startree only supports `distinctcounthll` because it's
intermediate result size is bounded  
 **@jackie.jxt:** You need to add `limit` to the query, or it defaults to 10  
 **@bagi.priyank:** Oh no theta sketch either?  
**@npawar:** this is the list of supported functions . No theta sketch yet  
 **@bagi.priyank:** And thank you Jackie!  

###  _#releases_

  
 **@akshay13jain:** @akshay13jain has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org