You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/03/19 02:00:16 UTC

Apache Pinot Daily Email Digest (2021-03-18)

### _#general_

  
 **@virtualandy:** @virtualandy has joined the channel  
 **@osman:** @osman has joined the channel  
 **@ravi.maddi:** Hi All, one basic doubt, I run quick start stream, I
understand the all the ports and components behind that, I am not able to
understand about *2191. what is running with 2191 port?*  
**@g.kishore:** Zookeeper. Please use <#C011C9JHN7R|troubleshooting> channel  
**@joshhighley:** Will upsert work with hybrid tables? Will a realtime record
become active over an offline record having the same primary key value?  
**@mayanks:** Currently, upsert support is limited to real-time tables only.  
**@jackie.jxt:** No. The plan is to support uploading segments to real-time
table, and it is a work in progress  

###  _#random_

  
 **@virtualandy:** @virtualandy has joined the channel  
 **@osman:** @osman has joined the channel  

###  _#feat-text-search_

  
 **@brianolsen87:** @brianolsen87 has joined the channel  

###  _#feat-presto-connector_

  
 **@brianolsen87:** @brianolsen87 has joined the channel  

###  _#troubleshooting_

  
 **@phuchdh:** hi team. Is there anyway to check minion task:
`RealtimeToOfflineSegmentsTask` status or error message ? I’m find warn log in
brokers-server. But cannot find any log information in minion-server. Is this
log relations to this task ? ```2021/03/17 09:26:49.639 WARN
[TimeBoundaryManager] [HelixTaskExecutor-message_handle_thread] Failed to find
segment with valid end time for table: RuleLogsUAT_OFFLINE, no time boundary
generated 2021/03/17 09:27:06.989 WARN [BaseBrokerRequestHandler] [jersey-
server-managed-async-executor-0] Failed to find time boundary info for hybrid
table: RuleLogsUAT```  
**@npawar:** The messages will be either in the controller or minion log.
Broker messages will not be related to the minion task  
**@mayanks:** The log you posted above happens when the endTime in the segment
zk metadata is <= 0:  
**@mayanks:**  
**@phuchdh:** I’m found the errors logs in controller-logs ```2021/03/17
10:34:53.192 WARN [ZkClient] [TaskJobPurgeWorker-pinot-quickstart] Failed to
delete path /pinot-
quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349!
org.I0Itec.zkclient.exception.ZkException:
org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
Directory not empty for /pinot-
quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 10:35:07.504 ERROR [JobDispatcher] [HelixController-pipeline-task-
pinot-quickstart-(a699ebbf_TASK)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 10:35:07.517 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-quickstart]
Job
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
exists in JobDAG but JobConfig is missing! Job might have been deleted
manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left
in the DAG due to a failed clean-up attempt from last purge. 2021/03/17
10:35:07.607 ERROR [JobDispatcher] [HelixController-pipeline-task-pinot-
quickstart-(fa8a46d1_TASK)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 10:35:07.796 ERROR [JobDispatcher] [HelixController-pipeline-task-
pinot-quickstart-(329800c1)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 10:35:07.878 ERROR [JobDispatcher] [HelixController-pipeline-task-
pinot-quickstart-(6e9abcf3_TASK)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The
configuration 'realtime.segment.flush.threshold.rows' was supplied but isn't a
known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8]
The configuration 'stream.kafka.consumer.prop.group.id' was supplied but isn't
a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig]
[pool-8-thread-8] The configuration 'stream.kafka.decoder.class.name' was
supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN
[ConsumerConfig] [pool-8-thread-8] The configuration 'streamType' was supplied
but isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig]
[pool-8-thread-8] The configuration 'realtime.segment.flush.segment.size' was
supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN
[ConsumerConfig] [pool-8-thread-8] The configuration
'stream.kafka.consumer.type' was supplied but isn't a known config. 2021/03/17
10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration
'stream.kafka.broker.list' was supplied but isn't a known config. 2021/03/17
10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The configuration
'realtime.segment.flush.threshold.time' was supplied but isn't a known config.
2021/03/17 10:37:53.084 WARN [ConsumerConfig] [pool-8-thread-8] The
configuration 'stream.kafka.consumer.prop.auto.offset.reset' was supplied but
isn't a known config. 2021/03/17 10:37:53.084 WARN [ConsumerConfig]
[pool-8-thread-8] The configuration 'stream.kafka.consumer.factory.class.name'
was supplied but isn't a known config. 2021/03/17 10:37:53.084 WARN
[ConsumerConfig] [pool-8-thread-8] The configuration 'stream.kafka.topic.name'
was supplied but isn't a known config. 2021/03/17 11:04:53.201 ERROR
[JobDispatcher] [HelixController-pipeline-task-pinot-quickstart-(4edd07a9)]
Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 11:30:46.186 ERROR [JobDispatcher] [HelixController-pipeline-task-
pinot-quickstart-(597f559b_TASK)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 11:30:46.361 WARN [ZkClient] [TaskJobPurgeWorker-pinot-quickstart]
Failed to delete path /pinot-
quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352!
org.I0Itec.zkclient.exception.ZkException:
org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
Directory not empty for /pinot-
quickstart/PROPERTYSTORE/TaskRebalancer/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352
2021/03/17 11:30:46.406 WARN [ZkBaseDataAccessor] [HelixController-pipeline-
task-pinot-quickstart-(27f5421b_TASK)] Fail to read record for paths: {/pinot-
quickstart/INSTANCES/Server_pinot-server-0.pinot-server-
headless.analytics.svc.cluster.local_8098/MESSAGES/49538041-4770-474f-b8b3-a414e638244f=-101}
2021/03/17 11:30:46.486 ERROR [JobDispatcher] [HelixController-pipeline-task-
pinot-quickstart-(27f5421b_TASK)] Job configuration is NULL for
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
2021/03/17 11:30:46.515 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-quickstart]
Job
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615891980352
exists in JobDAG but JobConfig is missing! Job might have been deleted
manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left
in the DAG due to a failed clean-up attempt from last purge. 2021/03/17
11:30:53.066 WARN [TopStateHandoffReportStage] [HelixController-pipeline-
default-pinot-quickstart-(aafda2de_DEFAULT)] Event aafda2de_DEFAULT : Cannot
confirm top state missing start time. Use the current system time as the start
time. 2021/03/17 11:30:53.159 WARN [TopStateHandoffReportStage]
[HelixController-pipeline-default-pinot-quickstart-(50b48792_DEFAULT)] Event
50b48792_DEFAULT : Cannot confirm top state missing start time. Use the
current system time as the start time. 2021/03/17 11:32:55.353 WARN
[SegmentCompletionFSM_RuleLogs__0__0__20210316T1130Z] [grizzly-http-server-0]
COMMITTER_NOTIFIED:Aborting FSM (too late) instance=Server_pinot-
server-2.pinot-server-headless.analytics.svc.cluster.local_8098 offset=17939
now=1615980775353 start=1615980646028```  
**@phuchdh:** my Realtime Table Config ```{ "REALTIME": { "tableName":
"RuleLogs_REALTIME", "tableType": "REALTIME", "segmentsConfig": {
"replication": "2", "replicasPerPartition": "2", "timeColumnName":
"created_at_days_epoch", "schemaName": "RuleLogs" }, "tenants": { "broker":
"DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} },
"tableIndexConfig": { "sortedColumn": [ "campaign_id", "rule_id" ],
"loadMode": "MMAP", "invertedIndexColumns": [ "user_id", "device_id" ],
"autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false, "streamConfigs": {
"streamType": "kafka", "stream.kafka.topic.name": "xxx",
"stream.kafka.broker.list": "confluent-cp-kafka-
headless.kafka.svc.cluster.local:9092", "stream.kafka.consumer.prop.group.id":
"c1.promotion", "stream.kafka.consumer.type": "lowLevel",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "200M" }, "enableDefaultStarTree":
false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false,
"nullHandlingEnabled": false }, "metadata": { "customConfigs": {} }, "quota":
{}, "task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": {
"collectorType": "concat", "bucketTimePeriod": "1d", "bufferTimePeriod": "1h",
"maxNumRecordsPerSegment": "1000000" } } }, "routing": {}, "query": {},
"ingestionConfig": { "transformConfigs": [ { "columnName": "device_id",
"transformFunction": "jsonPathString(extra_data,
'$.user_attributes.audience.device_id')" }, { "columnName": "service_code",
"transformFunction": "jsonPathString(extra_data, '$.service_code')" }, {
"columnName": "event", "transformFunction": "jsonPathString(extra_data,
'$.event')" }, { "columnName": "created_at_days_epoch", "transformFunction":
"toEpochDays(created_at_ts)" } ] }, "isDimTable": false } }```  
**@npawar:** @jackie.jxt any idea regarding this error? It's coming from the
task framework I think?  
**@jackie.jxt:** Yes, the error is from Helix  
**@jackie.jxt:** @phuchdh Did you delete any task through controller API?
```2021/03/17 10:35:07.517 ERROR [TaskUtil] [TaskJobPurgeWorker-pinot-
quickstart] Job
TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1615888367349
exists in JobDAG but JobConfig is missing! Job might have been deleted
manually from the JobQueue: TaskQueue_RealtimeToOfflineSegmentsTask, or left
in the DAG due to a failed clean-up attempt from last purge.```  
**@jackie.jxt:** Also, the `WARN` from broker means either the offline segment
has invalid end time, or there is no offline segment. Can you check if any
segment is pushed to the offline table?  
**@phuchdh:** i don’t delete any task through controller API.  
**@phuchdh:** In my scenario. I config my realtime job segments create in 1
days, then i except config segment task convert realtime table to offline
table.  
**@phuchdh:** here is some segments yesterday & today.  
**@phuchdh:** but when i check segments created yesterday, it’s status is
OFFLINE. So i think i have problem with realtime segments to offline segments
task  
**@phuchdh:**  
**@phuchdh:** but another partition was mark by status ONLINE  
 **@ali:** @ali has left the channel  
 **@virtualandy:** @virtualandy has joined the channel  
 **@osman:** @osman has joined the channel  
 **@ravi.maddi:** ** I am trying to push to kafka, I am not able to get any
thing as response. And data not appearing in the consumer console also.
*Please help me* how to trace it , where can I found logs. is there any
*options to add to my command* to see more detailed log. I am running this
command: ```bin/kafka-console-producer.sh --broker-list localhost:19092
--topic mytopic opt_flatten_json.json``` I am getting output this only:
```>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>```  
**@fx19880617:** for console producer, you need to type the data then press
ENTER to send one message  
**@fx19880617:** Please read this one:  
**@fx19880617:** also this:  
**@pabraham.usa:** Hello, I have a JSON data `log` and want to extract values
based on keys (`urlpath`). So tried to use JSONIndex however fails during
parsing. So ingested it as normal string and tried
JSONEXTRACTSCALAR/*json_extract_scalar* however this also fails during
parsing. Finally I ended up using Groovy function like `GROOVY('{"returnType":
"STRING", "isSingleValue": true}', 'java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(\"urlpath\":\")((?:\\\\\"|[^\\\\\"]*))");
java.util.regex.Matcher m = p.matcher(arg0); if(m.find()){ return m.group(2);
} else { return "";}',log)` and this works in SQL. Now I want to add this
Groovy function inside table config to do ingestionTransform to define a new
columnName. Is this possible? For ingestion transform can we do multi line ,
semi colon separated script?  

###  _#pinot-dev_

  
 **@brianolsen87:** @brianolsen87 has joined the channel  
 **@ken:** If I need to determine the number of groups from an aggregation
query, where the groups are filtered by aggregation result, are there any
recommended approaches? E.g. group by minute, sum page views, and I only care
about minutes with > 1000 page views - what’s a good way to determine the
number of interesting minutes? Assume there can be many (e.g. > 1M groups) for
my specific use case, so I can’t do an order by with some large limit.  
**@mayanks:** Are you referring to the `Having` clause? If so, that is
supported.  
**@ken:** I can use `having <filter>` to restrict groups, but is there a way
to count the number of groups without returning the groups (and using some
arbitrarily huge limit)?  
**@ken:** (side note - what page is `having` documented on?  
**@mayanks:** `Having` follows SQL syntax so might not be explicitly
documented. But that brings up a good point, perhaps we should catalogue what
is supported and what isn't (from SQL).  
**@mayanks:** I am unsure if there's a better way to do what you want other
than `Having`. Perhaps the `Having` implementation can be optimized to do
filtering on server side (pre combine) if the query allows for it (for example
if a monotonically increasing aggr function has a filter).  
**@ken:** If the data is segmented by the group key then I would imagine
server-side filtering is a possible optimization, otherwise I think you don’t
know whether the aggregation results of the gather phase would pass the filter
until all results have been combined from all servers. And that means an
unbounded amount of memory for the priority map (or whatever is used to
collect the results).  
**@mayanks:** I was referring to cases like count(*) or sum (with +ve numbers)
and filters like xxx > yyy where you can safely do filtering in scatter phase.
Agreed though, you can't always do that.  

###  _#getting-started_

  
 **@virtualandy:** @virtualandy has joined the channel  

###  _#segment-write-api_

  
 **@npawar:** should we have another meeting tomorrow?  
 **@yupeng:** ok for me  
 **@chinmay.cerebro:** :thumbsup:  
 **@yupeng:** will 3:30-4 work for you @npawar  
 **@npawar:** yes works  
 **@fx19880617:**  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org