You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/05/17 03:01:18 UTC

Apache Pinot Daily Email Digest (2022-05-16)

### _#general_

**@hello472:** @hello472 has joined the channel
**@ysuo:** Hi team, how to check if instanceAssignmentConfigMap config takes
effect?
**@mayanks:** One somewhat roundabout way of doing so would be to do a dry-run
of rebalance and see if it shows any changes in ideal-state. If not, then it
has taken effect.
**@harish.bohara:** i have 2-3 fileds in metricFieldSpecs. These columns
captures taken to do some operation (e.g. time taken to sent -> delivery of a
items). Any idea of how to get a histogram of this data (to be used in
Superset)?
**@mayanks:** Afaik, there isn’t an inbuilt histogram function. You could use
the percentileTDigest to get fast percentiles for histo. cc: @jackie.jxt
@kharekartik
**@saurabhkumarsharma96:** @saurabhkumarsharma96 has joined the channel
**@email2sandhu01:** @email2sandhu01 has joined the channel
**@karinwolok1:** If anyone wants to submit a talk:
**@maarten:** @maarten has joined the channel

### _#random_

### _#troubleshooting_

**@wcxzjtz:** hello, wondering how I can check if a query is using
rangeIndex. I added the index config like following, but from the tracing
info, i didn’t see `RangeIndexBasedFilterOperator` is used. ```
"rangeIndexColumns": [ "some_column" ],``` btw, we are using pinot 0.8
**@wcxzjtz:** actually, I see it now. thanks . but looks like it only works
for offline table?
**@wcxzjtz:** @richard892 when you have time.
**@wcxzjtz:** hold on. there maybe some issue with my data.
**@xiangfu0:** better check with @richard892, likely that thing is not
enabled.
**@hello472:** @hello472 has joined the channel
**@ysuo:** Hi, we deployed presto based on the helm file. It seems like
offset is not enabled. Any idea how to enable offset?
**@kharekartik:** @haitao
**@haitao:** @xiangfu0 has more knowledge about the helm chart
**@xiangfu0:** what is offset you are mentioning to?
**@ysuo:** Hi, when using offset num1 limit num2 in presto query, it returned
Offset support is not enabled. presto:default> select * from table_name offset
10 limit 10; Query 20220516_231652_00449_7gh97 failed: *Offset support* is not
enabled
**@xiangfu0:** oh, use limit 10,10
**@xiangfu0:** also pinot doesn’t support offset without ordering
**@xiangfu0:** so just select * won’t give you consist results
**@xiangfu0:** why you need offset 10?
**@ysuo:** Hi, it’s presto query that returned offset not enabled. Need offset
for pagination.
**@xiangfu0:** Yeah, check presto query syntax
**@xiangfu0:** Note that pagination is not enabled
**@xiangfu0:** So the results is not stable
**@xiangfu0:** Better to fetch enough rows and cache from front-end
**@ysuo:** :ok_hand:
**@ysuo:** Thanks.
**@dadelcas:** Hello, I've got an issue with a realtime table which is
consuming from a topic with 16 partitions. Pinot is consuming from all
partitions except 1 and I can't find issues in the logs. Is there a way to
force pinot consuming from that partition? I've tried rebalancing the servers
and reloading all segments but it still won't consume from this one partition
**@saurabhd336:** Can you check if there are CONSUMING segments for all your
partitions in ZK? You can use the controller UI to check that. Here's an
example. It's under IDEALSTATES -> <tableName>_REALTIME. You should ideally
have one segment per partition with state as "CONSUMING". If there are some
partitions for which you don't have consuming segments, you might have to
manually create them to resume consumption.
**@saurabhd336:** The key in mapFields in of the format
<tableName>__<partitionGroupId>___<sequenceNumber>___<dateTime>
**@kharekartik:** Also, you can trigger `RealtimeSegmentValidation` task to
detect new partitions. This can be done via API call to controller `curl -X
GET "" -H "accept: application/json"`
**@dadelcas:** I had checked ideal states to confirm the table wasn't
consuming, there aren't entries in ZK for this partition. I would rather avoid
doing operations at this level
**@kharekartik:** @navi.trinity can you help here. What can be the cause of
only one partition not showing up?
**@dadelcas:** I've run the the segment validation task but still no luck
**@saurabhd336:** Was any segment delete command run for this partition's
segment @dadelcas?
**@saurabhd336:** Or are there any segments in OFFLINE state?
**@dadelcas:** There are no segments for this partition nor they've been
deleted
**@saurabhkumarsharma96:** @saurabhkumarsharma96 has joined the channel
**@nair.a:** Hi Team, Regarding lookup/Dimension Table and array data type
use case. We have created a Dimension Table with following schema: ```{
"schemaName": "test_dim_tags", "dimensionFieldSpecs": [ { "name": "id",
"dataType": "INT" }, { "name": "tag_name", "dataType": "STRING",
"singleValueField": false } ], "primaryKeyColumns": [ "id" ] }``` Now when we
use this table in lookup with Fact Table, query is returning no data or
throwing NullPointerExpection. We wanted to use pinot's array explode
functionality along with lookup. can someone please help to understand?
**@richard892:** I believe this is a feature gap in lookup
**@richard892:** I'll take a look and see if there are barriers to adding it
**@nair.a:** sure thanks @richard892
**@richard892:** featurewise it looks good, do you have a stack trace for the
NPE?
**@nair.a:** ```[ { "message":
"QueryExecutionError:\nProcessingException(errorCode:450,
message:InternalError:\njava.lang.NullPointerException\n\tat
org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
"errorCode": 200 }, { "message":
"QueryExecutionError:\nProcessingException(errorCode:450,
message:InternalError:\njava.lang.NullPointerException\n\tat
org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:242)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
"errorCode": 200 }, { "message":
"QueryExecutionError:\nProcessingException(errorCode:450,
message:InternalError:\njava.lang.NullPointerException\n\tat
org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
"errorCode": 200 }, { "message":
"QueryExecutionError:\nProcessingException(errorCode:450,
message:InternalError:\njava.lang.NullPointerException\n\tat
org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat
org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)",
"errorCode": 200 } ]```
**@richard892:** ok, this is most likely caused by the query being slow
**@richard892:** are these lookups in unfiltered group bys?
**@nair.a:** we had few filter conditions , if that's what you are asking for.
**@richard892:** can you remove the lookup from the query and post the
response metadata (numDocsScanned etc.) please?
**@nair.a:** ```"exceptions": [], "numServersQueried": 12,
"numServersResponded": 12, "numSegmentsQueried": 569, "numSegmentsProcessed":
32, "numSegmentsMatched": 32, "numConsumingSegmentsQueried": 4,
"numDocsScanned": 37273560, "numEntriesScannedInFilter": 88491445,
"numEntriesScannedPostFilter": 260914920, "numGroupsLimitReached": false,
"totalDocs": 5011102229, "timeUsedMs": 595, "offlineThreadCpuTimeNs": 0,
"realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0,
"realtimeSystemActivitiesCpuTimeNs": 0,
"offlineResponseSerializationCpuTimeNs": 0,
"realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0,
"realtimeTotalCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {},
"minConsumingFreshnessTimeMs": 1652704731377, "numRowsResultSet": 350```
**@richard892:** ok so it's quite a heavy query, and then the lookup will make
that worse because the approach employed is not very efficient, which makes
timeout rather than feature incompleteness a more likely diagnosis
**@richard892:** all I can say is lookup isn't powerful enough to power
anything but the simplest and lightest weight join use cases, but the multi
stage query engine will solve problems like this one
**@nair.a:** thats great. loking forward for it.
**@nair.a:** @richard892 one more thing with Dimension table, lookups starts
to return null after sometime, we have to rerun the ingestion job to fix this,
any know reason?
**@email2sandhu01:** @email2sandhu01 has joined the channel
**@maarten:** @maarten has joined the channel
**@stuart.millholland:** So I've setup my controller/minions/servers to use a
gcs bucket in a gke environment. Is there an easy button way to test that the
gcs bucket perms and such are working correctly? I don't have any data yet, so
curious if there's a way to test things are working.
**@mayanks:** Check controller/server logs to see how PinotFs is initialized.
**@stuart.millholland:** logs don't have any complaints
**@mayanks:** Do the logs contain something like: `Initializing PinotFS for
scheme` for the right deep-store (GCS)?

### _#getting-started_

**@hello472:** @hello472 has joined the channel
**@filipdolinski:** Hi all,
**@filipdolinski:** I am looking for a spark connector for writing the data
to pinot. I saw on github, that write support will be availible in the future.
Do you have any news about it or tips how to deal it? Thank you in advance !
**@kharekartik:** we currently don't support writing spark dataframes/RDDs
directly to Pinot. However, you can use our spark plugin to read your data and
dump into pinot's storage. You can find the documentation here -
**@kharekartik:** Example recipe -
**@saurabhkumarsharma96:** @saurabhkumarsharma96 has joined the channel
**@email2sandhu01:** @email2sandhu01 has joined the channel
**@maarten:** @maarten has joined the channel
**@rbobbala:** Hello Team, I'm new to Apache Pinot I'm have setup my Apache
pinot cluster in my local laptop using KinD and Helm My question is: What is
the best way to automate the upload of new schema, table and job (realtime &
batch ingestion) files to pinot ?

### _#introductions_

**@hello472:** @hello472 has joined the channel
**@saurabhkumarsharma96:** @saurabhkumarsharma96 has joined the channel
**@email2sandhu01:** @email2sandhu01 has joined the channel
**@maarten:** @maarten has joined the channel
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org