Apache Pinot Daily Email Digest (2021-09-30)

### _#general_

 **@gqian3:** Hi team, we are currently evaluating a solution using Pinot
hybrid table to produce a dataset with both S3 offline historical data and
Kafka real time data. Is there some documents we can find the information
about what hybrid table setup support and doesn’t support, regarding e.g.
ingestion, query and retention etc. Thanks.  
**@xiangfu0:** In pinot you can configure deep store on s3 and create a hybrid
table ingest data from both batch data source(s3) and realtime data
**@gqian3:** Thanks, so far we only used offline table, other than table
configuration setup, is there any known functional differences, limitation or
constraints of using a hybrid table compared to the offline table, in terms of
query, retention and ingestion?  
**@xiangfu0:** Real-time table has different retention than offline table.
Ingestion wise, it's from Kafka . For each query, it's split into two queries
based on time boundary. Please check the doc for details  
**@dunithd:** I know the Lambda architecture is old-school. But is it correct
to say that Pinot fits into the ‘serving layer’ there?  
**@mayanks:** It is the unified serving + speed layer?  
 **@dadelcas:** Is there a way to configure the desired segment size and
segment creation job? I've got some small avro files and the job seems to
create a segment per file, is this how it works? I'd like to squash these
small files in to one bigger segment. Do I need to pre-process them myself
before running the job?  
**@mayanks:** There is a meetup today on the segment merging and roll up which
might help in your case  
 **@karinwolok1:** Join LinkedIn engineering team members in 15 minutes for
 **@karinwolok1:** In case you missed it :wine_glass: Presentation by @snlee
(Senior Software Engineer @ LinkedIn and Apache Pinot PMC) @jiatao (Software
### _#random_

###  _#troubleshooting_

 **@nadeemsadim:** @mayanks @xiangfu0 @jackie.jxt pinot-server ram usage is
getting increased over time without adding garbage collection params in
jvmopts in pinot/values.yaml helm .. before we were using jvmopts like
"*jvmOpts: "-Xms256M -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -Xloggc:/opt/pinot/gc-pinot-
controller.log"*" but after migrating to jdk 11 with these jvmopts .. the pods
started crashing and we have to remove these jvmopts and only using the below
jvmopts ie "*jvmOpts: "-Xms2M -Xmx8G -Xloggc:/opt/pinot/gc-pinot-
- server* " but without using garbage collection params .. we are seeing an
increase in pinot-server ram usage over time and ram is getting exhausted
every day by more than half gb .. what should be the jvmopts we should provide
in helm for jdk11 so that pods dont crash and gc happens properly and heap is
free .. also for 16 gb server ram .. what should be xmx value for server pod..  
**@xiangfu0:** you can still use *`-XX:+UseG1GC`*  
**@nadeemsadim:** what should be the pinot server xmx value in jvmopts in helm
if ram provided is 16 gb per pinot-server pod  
**@richard892:** @nadeemsadim my guess is you have a high cardinality inverted
index (do you?) and that means you have a lot of `SoftReference`s, which would
be cleared more aggressively by G1 on JDK8 than JDK11. If that's it, the issue
should be fixed here:  
**@richard892:** The best way to figure this out is to look at a *live* heap
dump, or use JFR `OldObjectSample` (do *NOT* do this in production, it has
very high overhead) - see here  
**@nadeemsadim:** yes I do have inverted index on many columns some of which
have high cardinality  
**@nadeemsadim:** I see the PR  merged 14 hours ago .. when can we expect this
to be released or ci cd has made this release ready and pinot pull policy
always will upgrade pinot to latest release after we do helm upgrade in our
pinot installation in k8s cluster? @xiangfu0 @jackie.jxt @mayanks @richard892  
**@richard892:** Would you be able to confirm the suspected cause with a live
heap dump (`jmap -dump:live,file=dump.bin <pid>`)? If my guess is correct,
there should be a lot of `ImmutableRoaringBitmap` by retained size.  
**@richard892:** Do not send share the heap dump because the strings will
contain sensitive data, but getting a screenshot of top retained size by type
either from MAT or JVisualVM heapdump viewer would confirm the guess.  
**@nadeemsadim:** ok let me check  
 **@trustokoroego:** Hi, I get below error when starting a pinot broker, any
idea what could be causing it. The key thing I want to achieve is to set the
broker to use hostname instead of the IP which changes on restart:
```Executing command: StartBroker -zkAddress pinot-zookeeper:2181
-configFileName /tmp/config/broker.conf Caught exception while starting
broker, exiting java.lang.NullPointerException: null at
java.util.HashMap.putMapEntries( ~[?:?] at
java.util.HashMap.putAll( ~[?:?] at
dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at
dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at
dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at
 **@trustokoroego:** Config setting: ``` # Pinot Cluster name # Use hostname as Pinot Instance ID other than IP # Pinot Broker Query Port # Pinot Routing table builder class```  
 **@bajpai.arpita746462:** Hi All, I am trying to enable "UPSERT" mode in
REALTIME table config in pinot 0.8.0 and the table is not able to read the
records send to kafka topic. No results are displayed in PINOT UI at all, it
shows 0 records. Below is the config I added for Upsert: "routing": {
"instanceSelectorType": "strictReplicaGroup" }, "upsertConfig": { "mode":
"FULL" }, I could not find anything significant in the controller logs as
well. But when I remove the UPSERT config and tried, then my RealTime Table is
able to read the records and getting displayed in Pinot UI. Any idea why is
this happening?  
**@dadelcas:** Just to confirm, have you defined a primary key in your schema?  
**@bajpai.arpita746462:** yes  
**@dadelcas:** It may help if you post both your table config and schema  
**@bajpai.arpita746462:** we are suspecting problem on kafka , we are trying
to create topic with proper partitioning.Below is the schema: { "schemaName":
"wxcanalytics", "primaryKeyColumns": ["orgId","reportId"],
"dimensionFieldSpecs": [ { "name": "reportId", "dataType": "STRING" }, {
"name": "orgId", "dataType": "STRING" }, { "name": "firstName", "dataType":
"STRING" }, { "name": "lastName", "dataType": "LONG" } ],
"dateTimeFieldSpecs": [ { "name": "pdate", "dataType": "STRING", "format":
"1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" } ] } Table
config: { "tableName": "wxcanalytics_REALTIME", "tableType": "REALTIME",
"segmentsConfig": { "timeType": "DAYS", "schemaName": "wxcanalytics",
"retentionTimeUnit": "DAYS", "retentionTimeValue": "7", "timeColumnName":
"pdate", "replicasPerPartition": "1" }, "tenants": { "broker":
"DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": {
"streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type":
"lowlevel", "": "bc_data",
"": "xxxxxx:xxxxx",
"realtime.segment.flush.threshold.rows": "1000000",
"realtime.segment.flush.threshold.time": "1h",
"": "smallest" },
"enableDynamicStarTreeCreation": false, "aggregateMetrics": false,
"nullHandlingEnabled": false, "autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP",
"enableDefaultStarTree": false }, "metadata": { "customConfigs": {} },
"routing": { "instanceSelectorType": "strictReplicaGroup" }, "upsertConfig": {
"mode": "FULL" }, "isDimTable": false }  
**@gabuglc:** can u try moving the primaryKeyColumns after dateTimeFieldSpecs?  
**@gabuglc:** "upsertConfig": { "mode": "FULL", "hashFunction": "NONE" },  
