You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Pinot Slack Email Digest <> on 2020/12/31 02:00:13 UTC

Apache Pinot Daily Email Digest (2020-12-30)

### _#general_

 **@wrbriggs:** I apologize in advance for my ignorant question, but I’m
struggling conceptually a bit with how to handle dateTime column definitions
in my table schema and segmentsConfig. I have a millisecond-level epoch field
on my incoming realtime data (creatively named `eventTimestamp`). I would like
to maintain this when querying / filtering my records at the individual event
level. However, I would also like to define an hourly derived timestamp to be
used for pre-aggregating with a star tree index. My segments config looks like
this: ``` "segmentsConfig": { "timeColumnName": "eventTimestamp", "timeType":
"MILLISECONDS", "retentionTimeUnit": "HOURS", "retentionTimeValue": "48",
"segmentPushType": "APPEND", "segmentAssignmentStrategy":
"BalanceNumSegmentAssignmentStrategy", "schemaName": "mySchema",
"replication": "1", "replicasPerPartition": "1" },``` My star tree index looks
like this: ``` "starTreeIndexConfigs": [{ "dimensionsSplitOrder": [
"dimension1", "dimension2" ], "skipStarNodeCreationForDimensions": [ ],
"functionColumnPairs": [ "SUM__metric1", "SUM__metric2", "SUM__metric3",
"DISTINCT_COUNT_HLL__dimension3", "DISTINCT_COUNT_HLL__dimension4" ],
"maxLeafRecords": 10000 }],``` And my dateTimeFieldSpecs: ```
"dateTimeFieldSpecs": [ { "name": "eventTimestamp", "dataType": "LONG",
"format": "1:MILLISECONDS:EPOCH", "granularity": "1:HOUR", "dateTimeType":
"PRIMARY" } ],``` Can anyone confirm that this is the correct approach? Should
I be using an ingestion transformation of `toEpochHoursRounded` instead, and
specifying that as a DERIVED dateTimeField in the dateTimeFieldSpecs
configuration, and manually adding that to the dimensionsSplitOrder of my star
tree index?  
**@fx19880617:** @jackie.jxt I think in this case, we need to add a new column
for hour rounded time value then do star tree on it right  
**@wrbriggs:** @fx19880617 Thank you, that makes sense to me, but I was
confused as to why the dateTimeFieldSpec allows me to enter a granularity
different from the incoming format. Also, the current airport examples all use
the deprecated `timeFieldSpec`, which meant I had to go digging in the  and
read the 0.4.0 release notes talking about deprecating `timeFieldSpec` before
I realized I should be using `dateTimeFieldSpecs` instead - I might take a
stab at updating the example + docs once I get this all straight in my head,
to save other people the pain (as long as I’m on the right track, here).  
**@fx19880617:** true, we are updating code base with this pr:  
**@fx19880617:** will update the wiki as well  
**@wrbriggs:** Heh, awesome - I also made the change locally for the `latest`
image for submitting admin commands as jobs :slightly_smiling_face:  
**@fx19880617:** the link you put was outdated wiki  
**@fx19880617:** let me know if docs.pinot helps  
**@fx19880617:** we will update in this site  
**@wrbriggs:** Thanks  
**@wrbriggs:** So it looks like `dateTimeType` (e..g, `PRIMARY`, `SECONDARY`,
or `DERIVED`) is no longer necessary?  
**@fx19880617:** it’s not  
**@fx19880617:** you can define multiple dateTimeFields  
**@fx19880617:** and specify the transform in the table  
**@fx19880617:** you can set `ingestionConfig` in table, e.g. ```{
"tableName": "githubEvents", "tableType": "OFFLINE", "segmentsConfig": {
"segmentPushType": "APPEND", "segmentAssignmentStrategy":
"BalanceNumSegmentAssignmentStrategy", "schemaName": "githubEvents",
"replication": "1", "timeColumnName": "event_time", "timeType": "MILLISECONDS"
}, "tenants": {}, "tableIndexConfig": { "starTreeIndexConfigs": [ {
"dimensionsSplitOrder": [ "type", "repo_id" ],
"skipStarNodeCreationForDimensions": [], "functionColumnPairs": [
"SUM__pull_request_additions", "SUM__pull_request_deletions",
"SUM__pull_request_changed_files", "COUNT__star",
"DISTINCT_COUNT_HLL__actor_id" ], "maxLeafRecords": 1000 } ],
"enableDynamicStarTreeCreation": true, "loadMode": "MMAP",
"invertedIndexColumns": [], "segmentPartitionConfig": { "columnPartitionMap":
{ "repo_id": { "functionName": "Murmur", "numPartitions": 1024 } } },
"noDictionaryColumns": [] }, "routing": { "segmentPrunerTypes": [ "partition"
] }, "metadata": { "customConfigs": {} }, "ingestionConfig": {
"batchIngestionConfig": { "segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY", "batchConfigMaps": [],
"segmentNameSpec": {}, "pushSpec": {} }, "transformConfigs": [ { "columnName":
"event_time", "transformFunction": "fromDateTime(created_at, \"yyyy-MM-
dd'T'HH:mm:ssZ\")" } ] } }```  
**@fx19880617:** here i convert `yyyy-MM-dd` format string column `created_at`
in raw data to millis epoch value to `event_time`  
**@fx19880617:** you can specify more time fields and add them into this
transformConfigs, fyi:  
**@wrbriggs:** Perfect, thank you. One more stupid question (hopefully last
one for the day)… what should I look for in the trace in order to verify that
my query is using my star tree index? Is there a Pinot equivalent of SQL
**@fx19880617:** typically from the results, you can see numDocsScanned  
**@fx19880617:** which should be way less than the total docs  
**@fx19880617:** e.g.  
**@fx19880617:** @jackie.jxt might provide more insights here  
**@wrbriggs:** Ok. I have inverted indices as well, so I was just trying to
figure out how to ensure it was using the star tree index instead - it is
definitely showing far fewer scanned than total:  
**@wrbriggs:** I just barely started ingestion, so I need to let it build up
some more data :slightly_smiling_face:  
**@fx19880617:** ic  
**@fx19880617:** for consuming segment, i think there is no star-tree built  
**@fx19880617:** it will go to inv index  
**@wrbriggs:** Ah  
**@fx19880617:** once the segment is sealed, star-tree will be built  
**@wrbriggs:** That makes sense  
**@jackie.jxt:** Another way is to enable the tracing for the query and see if
it uses the `StarTreeFilterOperator`  
**@jackie.jxt:** For the date time fields, is this column already rounded to
each hour? ``` "dateTimeFieldSpecs": [ { "name": "eventTimestamp", "dataType":
"LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:HOUR" } ],```  
**@jackie.jxt:** If so, you can directly use it as the star-tree dimension, if
not, then you can create a new rounded time column and use it in the star-tree  

###  _#troubleshooting_

 **@contact:** Hey everyone, i'm trying to setup pinot from tarbal
distribution (so without docker) with an ansible playbook (hopefully will be
able to open source it at some point). However i hit a wall when trying to
load plugins, i'm using java 8 (`openjdk version "1.8.0_275"`) with the
following jvm flags: ```JVM_OPTS=-Xms1G -Xmx4G -XX:+UseG1GC
-XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -Dplugins.include=pinot-
pubsub,pinot-s3 -Xloggc:/var/log/pinot-gc-controller.log
 **@contact:** The directories are the same as in the docker image:  
 **@contact:** I've a test setup in docker working fine (with the same
controller config), however in bare metal with the tar distrib i'm getting:  
**@contact:** ```Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.pinot.plugin.filesystem.S3PinotFS Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd9b84> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb6> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at$startBootstrapServices$0(
~[pinot-al> Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: at
[pinot-all-0.6.0-jar-wit> Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]:
[pinot-all-0.6.0-ja> Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: at
[pinot-all-0.6.0-jar-with-dependen> Dec 30 16:34:21 ubuntu2004.localdomain
bash[10587]: at
[pinot-all-0.6.0-jar-with-dependencies.jar> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646bace> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafc> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: Caused by:
java.lang.ClassNotFoundException: org.apache.pinot.plugin.filesystem.S3PinotFS
Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: at ~[?:1.8.0_275] Dec
30 16:34:21 ubuntu2004.localdomain bash[10587]: at
java.lang.ClassLoader.loadClass( ~[?:1.8.0_275] Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646bacea> Dec 30 16:34:21
ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafc> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafc> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafc> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: at
~[pinot-all-0.6.0-jar-with-dependencies.jar:0.6.0-bb646baceafcd> Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: ... 13 more```  
 **@contact:** For more info here are part of the init logs which logs env
config: ```Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: ZkClient
monitor key or type is not provided. Skip monitoring. Dec 30 16:34:20
ubuntu2004.localdomain bash[10587]: Starting ZkClient event thread. Dec 30
16:34:21 ubuntu2004.localdomain bash[10587]: Terminate ZkClient event thread.
Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: Terminate ZkClient event
thread. Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: Closed zkclient
Dec 30 16:34:21 ubuntu2004.localdomain bash[10587]: Initializing
PinotFSFactory Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: Client
dependencies.jar Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: Client
Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: Client Dec 30 16:34:20 ubuntu2004.localdomain
bash[10587]: Client environment:java.compiler=<NA> Dec 30 16:34:20
ubuntu2004.localdomain bash[10587]: Client Dec 30
16:34:20 ubuntu2004.localdomain bash[10587]: Client environment:os.arch=amd64
Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: Client
environment:os.version=5.4.0-54-generic Dec 30 16:34:20 ubuntu2004.localdomain
bash[10587]: Client Dec 30 16:34:20
ubuntu2004.localdomain bash[10587]: Client environment:user.home=/root Dec 30
16:34:20 ubuntu2004.localdomain bash[10587]: Client
environment:user.dir=/usr/local/apache-pinot-incubating-0.6.0-bin Dec 30
16:34:20 ubuntu2004.localdomain bash[10587]: Initiating client connection,
connectString= sessionTimeout=30000
watcher=org.apache.helix.manager.zk.client.ZkConnectionManager@71e9ebae Dec 30
16:34:20 ubuntu2004.localdomain bash[10587]: Opening socket connection to
server Will not attempt to authenticate using SASL
(unknown error) Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: Socket
connection established to, initiating session Dec 30
16:34:20 ubuntu2004.localdomain bash[10587]: Session establishment complete on
server, sessionid = 0x100000024270010, negotiated
timeout = 30000 Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: zookeeper
state changed (SyncConnected) Dec 30 16:34:20 ubuntu2004.localdomain
bash[10587]: MBean
HelixZkClient:Key=10_1_0_11_2181_30000,Type=ZkConnectionManager has been
registered. Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]: MBean
HelixZkClient:Key=10_1_0_11_2181_30000,PATH=Root,Type=ZkConnectionManager has
been registered. Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]:
ZkConnection 10_1_0_11_2181_30000 was created for sharing. Dec 30 16:34:20
ubuntu2004.localdomain bash[10587]: Sharing ZkConnection 10_1_0_11_2181_30000
to a new SharedZkClient. Dec 30 16:34:20 ubuntu2004.localdomain bash[10587]:
ZkClient monitor key or type is not provided. Skip monitoring. Dec 30 16:34:20
ubuntu2004.localdomain bash[10587]: Starting ZkClient event thread.```  
 **@contact:** Do anyone have an idea ?  
 **@dlavoie:** Mind running a `ps aux` so we get confirmation of the exact
arguments that where provided to the jvm process?  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: For additional commands,