You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/03/24 02:00:25 UTC
Apache Pinot Daily Email Digest (2022-03-23)

### _#general_

  
 **@paularmand.verhaegen:** Hi all, I'm Paul-Armand Verhaegen. I'm the Data
Domain and Data Specialty Architect for a News Publisher (we operate in a
couple of European countries). Interested in basically anything with data
(science and engineering), electronics, making stuff, math, hard problems,
crypto tech, also organisational things related to data mesh. Here to learn
about Pinot, what it can do for us, and which datasketches are useful in our
RT dashboards.  
**@mark.needham:** for sketches, have a look at a presentation Mayank did
where he shows examples of where they work well -  
**@paularmand.verhaegen:** Thanks. Watched it, was indeed an interesting
presentation .  
 **@prashant.pandey:** Hi team :slightly_smiling_face:. We have a use-case
where we’d like to coalesce small segments to larger ones. However, it’s a
realtime table and we use RT2OFF to move segments to offline servers
periodically. Is it possible to use the minion merge rollup task to merge the
segments residing on OFFLINE servers (although the docs explicitly mention
that it only supports OFFLINE tables)? Thanks :slightly_smiling_face:  
**@mark.needham:** Yeh - you can use RT2OFF to move segments to offline and
then Merge Rollup to rollup the offline segments. An example showing how to do
the merge roll up here -  Example of RT2OFF here -  
**@prashant.pandey:** Thanks @mark.needham  
 **@asti.manuka:** @asti.manuka has joined the channel  
 **@golthiryus:** @golthiryus has joined the channel  
 **@krishna.gourishetti:** @krishna.gourishetti has joined the channel  

###  _#random_

  
 **@asti.manuka:** @asti.manuka has joined the channel  
 **@golthiryus:** @golthiryus has joined the channel  
 **@krishna.gourishetti:** @krishna.gourishetti has joined the channel  

###  _#troubleshooting_

  
 **@weixiang.sun:** When I am trying to use the lookup UDF join between
dimension table and realtime table, it does not work. But it works for
dimension table and offline table, Is it expected? I do not see such
restriction from . Is there anything missing?  
**@weixiang.sun:** @lakshmanan.velusamy  
**@lakshmanan.velusamy:** Getting the following exception : ```{ "errorCode":
200, "message":
"QueryExecutionError:\norg.apache.pinot.spi.exception.BadQueryRequestException:
Caught exception while initializing transform function: lookup\n\tat
org.apache.pinot.core.operator.transform.function.TransformFunctionFactory.get(TransformFunctionFactory.java:207)\n\tat
org.apache.pinot.core.operator.transform.TransformOperator.<init>(TransformOperator.java:56)\n\tat
org.apache.pinot.core.plan.TransformPlanNode.run(TransformPlanNode.java:56)\n\tat
org.apache.pinot.core.plan.SelectionPlanNode.run(SelectionPlanNode.java:83)\n\tat
org.apache.pinot.core.plan.CombinePlanNode$1.callJob(CombinePlanNode.java:133)\n\tat
org.apache.pinot.core.plan.CombinePlanNode$1.callJob(CombinePlanNode.java:119)\n\tat
org.apache.pinot.core.util.trace.TraceCallable.call(TraceCallable.java:44)\n\tat
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat
shaded.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)\n\tat
shaded.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)\n\tat
shaded.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)"
},``` Query: `select LOOKUP(a, b, c, d) from table_REALTIME limit 10`  
**@weixiang.sun:** @yupeng can you help us?  
 **@ryantle1028:** hello everyone i found some error whith transformFunction
jsonPathString i can not use word order in jsonPathString -->
"transformFunction": "jsonPathString(order,'$.channel')" -->this is not work.
i test modify json replace from order to hello and user this -->
"transformFunction": "jsonPathString(hello,'$.channel')" it's working. why i
can not use "order". my real json massage they use "order". Please help.  
**@ryantle1028:** Invalid transform function
'jsonPathString(order,'$.channel')' for column 'channel' exception: Invalid
transform function 'jsonPathString(order,'$.channel')' for column 'channel'
Handled request from 172.23.188.107 POST , content-type application/json
status code 400 Bad Request  
**@mark.needham:** without looking into it any further yet, it might be b/c
'order' is a reserved word in sql and this is likely going through a sql
parser...but maybe you can quote the word order and see if that works:
```"transformFunction": "jsonPathString(\"order\",'$.channel')"```  
**@ryantle1028:** oh it's work thank very much @mark.needham  
**@ryantle1028:** @mark.needham "submissionDate" :
"2022-03-15T17:31:44.540+0700" how can i use date format with this? I try to
use this but not work "dateTimeFieldSpecs": [ { "name": "submissionDate",
"dataType": "TIMESTAMP", "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-
dd'T'HH:mm:ss.SSSZ", "granularity": "1:MILLISECONDS" } ]  
**@mark.needham:** you'll need to use a transformation function for that as
well. You can see an example here -  
**@ryantle1028:** thank you.  
**@mark.needham:** you'll have to use a different column name than what is in
the source data btw  
**@mark.needham:** or you'll get an error  
**@ryantle1028:** on my json data type is string. "submissionDate" :
"2022-03-15T17:31:44.540+0700" I test to create table . it can create but not
see data. schema structure that i test is. { "schemaName": "omx_order_20",
"dimensionFieldSpecs": [ { "name": "channel", "dataType": "STRING" }, {
"name": "orderId", "dataType": "STRING" } ], "dateTimeFieldSpecs": [ { "name":
"submissionDate", "dataType": "STRING", "format":
"1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ", "granularity":
"1:MILLISECONDS" } ] }  
**@ryantle1028:** i will check your link and test, thank you for answer
reserved word.  
**@mark.needham:** oh as a string it should work. But if it's a string I don't
think you'd be able to do proper data operations (in case you wanted to do
that)  
**@ryantle1028:** normally for real time table. we use dataFormat from system
or in side json massage ?  
**@mark.needham:** unless you have a datetime column in your schema/table
there won't be one, so you do need to specify the data yourself (e.g. via JSON
message)  
**@ryantle1028:** i try to create table. it can create but not found any Data.
{ "tableName": "omx_order_20", "tableType": "REALTIME", "tenants": { "broker":
"DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} },
"segmentsConfig": { "schemaName": "omx_order_20", "timeColumnName":
"submissionDate", "replication": "1", "replicasPerPartition": "1",
"retentionTimeUnit": null, "retentionTimeValue": null, "completionConfig":
null, "crypterClassName": null, "peerSegmentDownloadScheme": null },
"tableIndexConfig": { "loadMode": "MMAP", "invertedIndexColumns": [],
"createInvertedIndexDuringSegmentGeneration": false, "rangeIndexColumns": [],
"sortedColumn": [], "bloomFilterColumns": [], "bloomFilterConfigs": null,
"noDictionaryColumns": [], "onHeapDictionaryColumns": [],
"varLengthDictionaryColumns": [], "enableDefaultStarTree": false,
"starTreeIndexConfigs": null, "enableDynamicStarTreeCreation": false,
"segmentPartitionConfig": null, "columnMinMaxValueGeneratorMode": null,
"aggregateMetrics": false, "nullHandlingEnabled": false, "streamConfigs": {
"streamType": "kafka", "stream.kafka.topic.name": "omx_order20",
"stream.kafka.broker.list": "172.19.131.55:9092",
"stream.kafka.consumer.type": "lowlevel",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "100M" } }, "metadata": {},
"ingestionConfig": { "filterConfig": null, "transformConfigs": [ {
"columnName": "channel", "transformFunction":
"jsonPathString(\"order\",'$.channel')" }, { "columnName": "orderId",
"transformFunction": "jsonPathString(\"order\",'$.orderId')" }, {
"columnName": "submissionDate", "transformFunction":
"FromDateTime('$.submissionDate','YYYY-MM-dd''T''HH:mm:ss.SSSZ')" } ] },
"quota": { "storage": null, "maxQueriesPerSecond": null }, "task": null,
"routing": { "segmentPrunerTypes": null, "instanceSelectorType": null },
"query": { "timeoutMs": null }, "fieldConfigList": null, "upsertConfig": null,
"tierConfigs": null }  
**@ryantle1028:** i use "columnName": "submissionDate", "transformFunction":
"FromDateTime('$.submissionDate','YYYY-MM-dd''T''HH:mm:ss.SSSZ')"  
**@ryantle1028:** my json like this --> {"order" : { "channel" : "ABC",
"orderId" : "22031500DRS020020017"}, "submissionDate" :
"2022-03-15T17:31:44.540+0700"}  
**@ryantle1028:** any recommend for Datetime format converse. @mark.needham  
**@mark.needham:** You can't use the JSON path syntax inside `FromDataTime` -
it doesn't know what it means  
**@ryantle1028:** oh  
**@mark.needham:** also I think you can only run one function at a time  
**@mark.needham:** so you'll need to do it in two steps  
**@ryantle1028:** i just want to create table consume kafka. i test with EPOCH
on json. it no need to do anything. but for this --> "submissionDate" :
"2022-03-15T17:31:44.540+0700" we need to converse ?  
**@mark.needham:** epoch you don't need any time conversion  
**@mark.needham:** it'll handle that automatically  
**@mark.needham:** it's only cause of it being a date string that you have to
do some conversion  
**@ryantle1028:** how can i do it bro.  
**@mark.needham:** ```{ "columnName": "submissionTs", "transformFunction":
"FromDateTime(submissionDate,'YYYY-MM-dd''T''HH:mm:ss.SSSZ')" }```  
**@mark.needham:** rename the column to something else  
**@mark.needham:** and then like this should work  
**@ryantle1028:** ok .modify column name another name .  
**@mark.needham:** yeh - if you use a transform fn it can't transform a value
to the same name  
**@mark.needham:** it's a bit annoying, but that's how it is!  
**@ryantle1028:** oh it's work.  
**@ryantle1028:** i have some quesion about json msg  
**@ryantle1028:** if in my json have many object. {"order": {"channel":
"SFF","orderId": "22031500DRS020020016"},"customer": "570809","omxtrackingID":
"99-d173c048-8bf2-4261-a440-36d1045c63e2","submissionDate":
"2022-03-15T17:31:44.540+0700"}  
**@ryantle1028:** but i just need only column channel, orderId and
submissionDate. i can secify and select some object that i want to use or i
need to create all column for that json ? @mark.needham  
**@ryantle1028:** you are expert in pinot bro. Thank you so much for you
answer, it's help me so much. @mark.needham. i just begin poc pinot and plan
to use with trino for realtime data platfrom.  
**@mark.needham:** you only need to create the columns that you need - don't
need to create columns for every JSON property  
**@ryantle1028:** Thank you bro. you help me so much. tomorrow i will select
some column for test poc with trino.  
**@ryantle1028:** Did you know how to see more log. i start via systemctl log
it a little not help anything. i want to see more log from pinot. i add this
on systemd --> Environment="JAVA_OPTS=-Xms6G -Xmx8G -Dlog4j2.logLevel=DEBUG"
log still less...  
**@mark.needham:** logs are under `logs/pinot-all.log` on each component  
**@mark.needham:** will be much more in there  
**@ryantle1028:** thank you bro.  
 **@asti.manuka:** @asti.manuka has joined the channel  
 **@valentin:** Hello, I was wondering if it’s possible to partition segments
based on a field value (but without any transformation). For example, I store
in pinot events from multiple websites, those events have name (i.e.
`purchase`, `page_view`…) and I would like to create a segment by event name
(with a size limit ofc). Since those events are user defined I can’t really
know how many partitions I’ll have. I’ve seen Murmur, Hashcode… partition
config but it doesn’t insure me that each event type will have a dedicated
segment (e.g. I don’t want `page_view` and `purchase` events to be in the same
segments, to avoid loading any `page_view` data when doing a query on
`page_view` ones) Thank you  
**@ken:** We do something similar (partitioning by country), but we have a
Flink workflow building segments, so this is pretty easy - we just generate
different segments that have a year-month and the country name in the segment
name, so that it’s partitioned by both date & country. We have one country
(US) that is significantly larger than the others, so we sub-partition that by
a hash of of the fields that we use for various aggregations.  
**@valentin:** Okay, so you aren’t using the native pinot partioning, right?  
**@richard892:** I recall there was recently a community contribution to do
just this, YMMV  
**@richard892:**  
**@valentin:** Oh, great! Thank you!  
 **@golthiryus:** @golthiryus has joined the channel  
 **@krishna.gourishetti:** @krishna.gourishetti has joined the channel  

###  _#pinot-dev_

  
 **@ashish:** Thanks will try it out  

###  _#getting-started_

  
 **@asti.manuka:** @asti.manuka has joined the channel  
 **@golthiryus:** @golthiryus has joined the channel  
 **@krishna.gourishetti:** @krishna.gourishetti has joined the channel  

###  _#jobs_

  
 **@et_tech:** @et_tech has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org