You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/07/01 02:00:19 UTC

Apache Pinot Daily Email Digest (2021-06-30)

### _#general_

  
 **@sajjan:** Hey Karin, I was looking for near real-time reporting tech for
our system and Pinot seems great. We can need other tools along with pinot. So
it good to be here guys.  
**@mayanks:** Great to have you, feel free to ask questions as you try out
Pinot  
 **@nicolas.richard:** @nicolas.richard has joined the channel  
 **@frederic.vanharen:** @frederic.vanharen has joined the channel  

###  _#random_

  
 **@nicolas.richard:** @nicolas.richard has joined the channel  
 **@frederic.vanharen:** @frederic.vanharen has joined the channel  

###  _#troubleshooting_

  
 **@jmeyer:** Hello I'm seeing odd behavior with filtering (WHERE predicate) A
column contains LONG values (single valued), and filtering only works for some
of the values in that column Here's the data:  
**@jmeyer:** Here's a query that works:  
**@mayanks:** IIRC, there was a UI issue that was causing this. Do you see the
same issue when making curl calls to Pinot from shell?  
**@mayanks:** The UI issue was resolved, but may not have made it to 0.7.1  
**@jmeyer:** Here's one that doesn't (the id is a copy paste from the data, no
doubt its valid)  
**@jmeyer:** @mayanks Thanks for the pointer, I'll check the API output  
**@jmeyer:** Output really is empty from the API  
**@jmeyer:**  
**@mayanks:** Hmm, the UI calls the api internally  
**@jmeyer:** Yes, and the API really returns no results for some of the ids  
**@mayanks:** I see, are you saying that both api and UI don't work in the
same way?  
**@mayanks:** what's the query?  
**@jmeyer:** They do, I'm saying it doesn't look like a UI bug  
**@jmeyer:** (FYI I'm on version
`0.8.0-SNAPSHOT-46009e152b8f56c244e415beefa81dbc626de7cb`)  
**@jmeyer:** > what's the query? NOK: `select * from test_communities where
entityId = 47923728534576490` OK: `select * from test_communities where
entityId = 47923728366804330`  
**@jmeyer:** Really odd...  
**@jmeyer:** @mayanks If you don't mind, it may be easier if I can do a short
demo whenever you've got time  
**@ken:** What happens with `select * from test_communities where entityId =
'47923728534576490'`?  
**@jmeyer:** @ken Same result (as in, no change in behavior for both queries)  
**@mayanks:** Where did you get the id to query from? If the id was from UI,
it might have been incorrect to begin with due to UI issue  
**@jmeyer:** *More context:* • Table contains 17 documents • Table was created
(definition, schema) using the API, data was fed via the `ingestFromFile` API
endpoint (unit testing)  
**@jmeyer:** > Where did you get the id to query from? If the id was from UI,
it might have been incorrect to begin with due to UInissue Indeed, I copy
pasted from the UI Which works for one ID, but not the others Do you recommend
me another way ?  
**@mayanks:** JavaScript truncates values in wierd ways  
**@mayanks:** Can you get the id using select * from curl command and then try
to repro?  
**@jmeyer:** `curl localhost:8099/query/sql -d '{"sql": "select * from
test_communities"}' | jq .` ```{ "resultTable": { "dataSchema": {
"columnNames": [ "communityId", "entityId", "entityType", "parentDepartmentId"
], "columnDataTypes": [ "STRING", "LONG", "STRING", "STRING" ] }, "rows": [ [
"604795e710fc330f65e9100e", 47923728366804330, <<< TEST ID.1 "user", "null" ],
[ "604795e710fc330f65e9100e", 47923728534576490, <<< TEST ID.2 "user", "null"
], [ "604795e710fc330f65e9100c", 62535436887531140, "department",
"de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 62535436887531080,
"department", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c",
62535436954639944, "department", "de2ba60543be995f958e1e80" ], [
"604795e710fc330f65e9100c", 47923728685571440, "user",
"de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c", 47923728803011950,
"user", "de2ba60543be995f958e1e80" ], [ "604795e710fc330f65e9100c",
62535436887531080, "department", "de2ba60543be995f958e1e46" ], [
"604795e710fc330f65e9100c", 47923728685571440, "user",
"de2ba60543be995f958e1e46" ], [ "604795e710fc330f65e9100c", 62535436954639944,
"department", "de2ba60943be995f958e1e47" ] ] }, "exceptions": [],
"numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 1,
"numSegmentsProcessed": 1, "numSegmentsMatched": 1,
"numConsumingSegmentsQueried": 0, "numDocsScanned": 10,
"numEntriesScannedInFilter": 0, "numEntriesScannedPostFilter": 40,
"numGroupsLimitReached": false, "totalDocs": 17, "timeUsedMs": 4,
"offlineThreadCpuTimeNs": 398035, "realtimeThreadCpuTimeNs": 0,
"segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0,
"numRowsResultSet": 10 }``` Now take the first returned ID (`entityid`) ->
`47923728366804330` *Query :* `curl localhost:8099/query/sql -d '{"sql":
"select * from test_communities WHERE entityId = 47923728366804330"}' | jq .`
```{ "resultTable": { "dataSchema": { "columnNames": [ "communityId",
"entityId", "entityType", "parentDepartmentId" ], "columnDataTypes": [
"STRING", "LONG", "STRING", "STRING" ] }, "rows": [ [
"604795e710fc330f65e9100e", 47923728366804330, "user", "null" ] ] },
"exceptions": [], "numServersQueried": 1, "numServersResponded": 1,
"numSegmentsQueried": 1, "numSegmentsProcessed": 1, "numSegmentsMatched": 1,
"numConsumingSegmentsQueried": 0, "numDocsScanned": 1,
"numEntriesScannedInFilter": 17, "numEntriesScannedPostFilter": 4,
"numGroupsLimitReached": false, "totalDocs": 17, "timeUsedMs": 4,
"offlineThreadCpuTimeNs": 342519, "realtimeThreadCpuTimeNs": 0,
"segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 0,
"numRowsResultSet": 1 }``` -> OK Now take the second ID -> `47923728534576490`
*Query :* `curl localhost:8099/query/sql -d '{"sql": "select * from
test_communities WHERE entityId = 47923728534576490"}' | jq .` ```{
"resultTable": { "dataSchema": { "columnNames": [ "communityId", "entityId",
"entityType", "parentDepartmentId" ], "columnDataTypes": [ "STRING", "LONG",
"STRING", "STRING" ] }, "rows": [] }, "exceptions": [], "numServersQueried":
1, "numServersResponded": 1, "numSegmentsQueried": 1, "numSegmentsProcessed":
1, "numSegmentsMatched": 0, "numConsumingSegmentsQueried": 0,
"numDocsScanned": 0, "numEntriesScannedInFilter": 0,
"numEntriesScannedPostFilter": 0, "numGroupsLimitReached": false, "totalDocs":
17, "timeUsedMs": 10, "offlineThreadCpuTimeNs": 383400,
"realtimeThreadCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {},
"minConsumingFreshnessTimeMs": 0, "numRowsResultSet": 0 }``` -> NOK  
**@jmeyer:** Pretty surprising :open_mouth:  
**@jmeyer:**  
**@jmeyer:** Logs don't show much but here they are  
**@jmeyer:**  
**@mayanks:** I missed where did you get `47923728366804330` and
`47923728534576490` from?  
**@jmeyer:** From the query results of the first query, column `entityId` (so,
2sd element from the arrays)  
**@jmeyer:** Here -> ```"rows": [ [ "604795e710fc330f65e9100e", >>>>>
47923728366804330 <<<<<, "user", "null" ],```  
**@jmeyer:** I can repro the issue after dropping the table, schema & data and
setting things back up I'll try to restart the cluster, start it from scratch,
using the latest Docker image *=> Same behavior* :confused:  
**@mayanks:** Can you file an issue with steps to reproduce? I can get to it
later today/tomorrow  
**@jmeyer:** @mayanks Will do, okay  
**@mayanks:** Cc @jackie.jxt  
**@jackie.jxt:** @jmeyer Where did you get the json response? If it is from
the query console "showing json", then the value is already truncated
(javascript only preserve 52 precision bits)  
**@jackie.jxt:** Here is the issue:  It is not fixed yet  
**@jackie.jxt:** Can you try querying broker directly using curl:  
**@jackie.jxt:** It should give the accurate result  
**@jmeyer:** @jackie.jxt That's what I've tried too  
**@jmeyer:** See  
**@jackie.jxt:** Hmm, how about without `jq`?  
**@jmeyer:** It's only for formatting the result Not sure I see then, which
query do you want me to try without it ?  
**@jackie.jxt:** ```jackie@MacBook-Pro ~ % echo '12345543211234554321' | jq .
12345543211234554000```  
**@jackie.jxt:** I think it also truncates the value  
**@jmeyer:** Ooooh interesting !  
**@jmeyer:** Let me try, nice catch  
**@jmeyer:** @jackie.jxt @mayanks @ken Sorry for the disturbance, @jackie.jxt
is right, `jq` was having the same issue as Pinot's UI, altering the LONG
values...  
**@jmeyer:** So, no issue, case solved :slightly_smiling_face:  
**@jmeyer:** Thanks again @jackie.jxt  
 **@elon.azoulay:** Hi, we are getting a 500 error (with no logs) when doing
queries with Integer.MAX_VALUE as the limit on pinot 0.7.1 - I see from the
code that should be the max limit. Is there something else we are running
into? (there are no table quotas). Here is the full error, redacted hosts and
table names to maintain super secrecy:) : ```Unexpected response status: 500
for request {"sql" : "select AAA, BBB from CCC limit 2147483647" } to url
:XXXX/query/sql, with headers {Accept=[application/json], Content-
Type=[application/json]}, full response null```  
**@elon.azoulay:** Should I create a github issue? I can also help and work on
this, lmk.  
**@elon.azoulay:** The controller has a stack trace but not the broker, and
there are no logs other than that the request was run (in pinotBroker.log)  
**@elon.azoulay:** Stack trace is basically that the post failed (from
controller). Also server does not have any logs (maybe request never makes it
that far)  
**@mayanks:** Is there an OOM?  
**@elon.azoulay:** I don’t see an oom or anything in the logs. The main
consumer of memory (ran jmap ) was the async logger. Also the table only has
4K rows. When I lower the limit the query runs. Then I raise it back and the
query fails. No crashes or ooms.  
**@elon.azoulay:** Have you ever seen this? I am retrying in local docker with
debug logging (on a test table though) - maybe I can find debug messages…  
**@elon.azoulay:** Ok, more debug info, I see that it made it to the server:  
**@elon.azoulay:**  
**@elon.azoulay:** And broker parsed the query just before that (about 10ms
before):  
**@elon.azoulay:**  
**@elon.azoulay:** So it looks like something that happened after broker
received the response from the server...  
**@elon.azoulay:** Server completed the response: ``` Processed
requestId=2,table=alltypes_REALTIME,segments(queried/processed/matched/consuming)=1/1/1/1,schedulerWaitMs=5,reqDeserMs=0,totalExecMs=4,resSerMs=0,totalTimeMs=12,minConsumingFreshnessMs=1625069569193,broker=Broker_f068e4137900_8099,numDocsScanned=11,scanInFilter=0,scanPostFilter=176,sched=fcfs```  
**@jackie.jxt:** I can reproduce the issue, let me look further into it  
**@elon.azoulay:** Thanks!  
**@elon.azoulay:** @jackie.jxt looks like it occurs after the
`reduceOnDataTable` call in
SingleConnectionBrokerRequestHandler::processBrokerRequest  
**@elon.azoulay:** Should I create a github issue for it?  
**@jackie.jxt:** @elon.azoulay Yes, please create an issue and we can post the
findings there  
**@elon.azoulay:** will do  
 **@nicolas.richard:** @nicolas.richard has joined the channel  
 **@frederic.vanharen:** @frederic.vanharen has joined the channel  

###  _#aggregators_

  
 **@leon.liu:** @leon.liu has joined the channel  
 **@leon.liu:** Hey good morning. I read some articles about Pinot, and feel
Pinot can be a great tool for our real time analytics platform. we currently
use snowflake and redshift. I tried it with a simple usecase (63 million
records with percentileest, avg aggration) on a single ec2 instance and the
performance is amazing. I want to pursue further and have a few questions
related with building star tree index for the aggregators. mainly we want to
make sure building the star tree indexes takes much shorter than the full
cubing. hope you can help me out: 1\. for our percentile aggregation, we only
care the values for 10, 25, 50, 75 and 90 percent. is there any way to do the
aggregation only for those percentiles? 2\. How do i know if a star tree index
is built? from the UI “Reload Status” screen, I don’t see anything related
with the star tree index 3\. currently we are doing very intensive monthly
cubing to support realtime analytics (percentile on 12 columns, avg on 12
columns, approx_cont_distinct on 5 columns). at the end of each month, we are
batch feeding about 70 million records. is it possible to build the star tree
index in a couple of hours? if so what are the recommended ways to speed up
the index building process? some context for our table: 1\. 40 dimension
columns, 1 time column and 15 metric column 2\. we have monthly feed about 70
million records 3\. we need monthly, quarterly and yearly analytics Thanks in
advance  

###  _#pinot-dev_

  
 **@leon.liu:** @leon.liu has joined the channel  

###  _#pinot-docs_

  
 **@amrish.k.lal:** Hello, I am wondering how I can update Pinot
documentation? I will like to make doc changes corresponding to recent json
querying functionality () and maybe create a tutorial dealing with json
queries ().  
 **@mayanks:** @amrish.k.lal please join <#C023BNDT0N8|pinot-docsrus> and
check the header of the channel  
 **@mayanks:** @mayanks set the channel topic: This channel is moved to
<#C023BNDT0N8|pinot-docsrus>  

###  _#pinot-perf-tuning_

  
 **@leon.liu:** @leon.liu has joined the channel  
 **@leon.liu:** Hey good morning. I read some articles about Pinot, and feel
Pinot can be a great tool for our real time analytics platform. we currently
use snowflake and redshift. I tried it with a simple usecase (63 million
records with percentileest, avg aggration) on a single ec2 instance and the
performance is amazing. I want to pursue further and have a few questions
related with building star tree index for the aggregators. mainly we want to
make sure building the star tree indexes takes much shorter than the full
cubing. hope you can help me out: 1\. for our percentile aggregation, we only
care the values for 10, 25, 50, 75 and 90 percent. is there any way to do the
aggregation only for those percentiles? 2\. How do i know if a star tree index
is built? from the UI “Reload Status” screen, I don’t see anything related
with the star tree index 3\. currently we are doing very intensive monthly
cubing to support realtime analytics (percentile on 12 columns, avg on 12
columns, approx_cont_distinct on 5 columns). at the end of each month, we are
batch feeding about 70 million records. is it possible to build the star tree
index in a couple of hours? if so what are the recommended ways to speed up
the index building process? some context for our table: 1\. 40 dimension
columns, 1 time column and 15 metric column 2\. we have monthly feed about 70
million records 3\. we need monthly, quarterly and yearly analytics Thanks in
advance  
**@mayanks:** Hello: ```1. The TDigest based percentile size remains same
regardless of what percentiles you want to query (unless there's a hidden
feature that I am unaware of). 2\. If you have access to the segment dir on
server, you can check the segment folder, there would be a startree index. But
if you file an issue, we can expose it in some fashion. 3\. Build time depends
on data size and the configuration you specified, but could be possible.```  
**@mayanks:** What's your latency requirement? And have you tried without
startree index if the requirement can be met?  
**@ken:** For Q3 - You can use the Spark or Hadoop (MapReduce) job runner to
build segments in parallel. If you configure your table to have indexes
generated when the segment is being built, you avoid some potential CPU/memory
bottlenecks when pushing these segments to the cluster. This all works well if
your Pinot cluster has access to HDFS (or some other shared filesystem), which
you can configure as your Pinot cluster’s deep store.  
**@mayanks:** +1 to what @ken said ^^  
**@leon.liu:** sub-second is our requirement. we also are client facing with
high concurrency  
**@mayanks:** what's the read qps?  
**@leon.liu:** 200 qps is good enough for us  
**@leon.liu:** if we use star tree index, the query performance will be better
that we are looking for, the only concern is how long it needs to build the
index. right now it takes 12 hours in AWS to do the full cubing with spark.  
**@leon.liu:** we are using a lot of percentile aggregation in our query, i
tried it without using any index, one query returns in about 6s. I’m
relatively new here, not sure if there is any other way to make it much faster
and avoid heavy indexing.  
**@leon.liu:** @ken our data is in AWS s3. For the Q3 suggestion, is there any
example i can take a look for reference? or some more detailed documentation
will greatly help  
**@mayanks:** How many docs did the 6s query take? If too many (>100k), then
star tree is the right index  
**@leon.liu:** for the 6s query, it scans all of the docs i loaded (63
million)  
**@leon.liu:** if i load all of the data (36 months), the total docs will be
above 1 billion  

###  _#getting-started_

  
 **@leon.liu:** @leon.liu has joined the channel  

###  _#pinot-docsrus_

  
 **@mark.needham:** I made a PR that puts a link to the multi valued columns
docs next to where all the column types are described -  
**@mark.needham:** I thought there was gonna be some sort of list or array
type, I didn't realise it was a property that you could toggle  
 **@atri.sharma:** @atri.sharma has joined the channel  
 **@amrish.k.lal:** @amrish.k.lal has joined the channel  
 **@ken:** @ken has joined the channel  
 **@xiangfu0:** @xiangfu0 has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org