### _#general_

 **@neer.shay:** Hi! I'm trying to get a little more information on ThirdEye
and how it stacks up compared to Sherlock/Druid so I have a few questions: 1\.
What is going on behind the scenes? Is there some sort of model running which
trains on historical data and learns what an anomaly is? 2\. How configurable
is this? Can I specify which dimensions/multi-dimensions to run on or does it
automatically run on everything? 3\. How often does it run? Is this
configurable? 4\. Where does it store its metadata? Thanks in advance for the
**@g.kishore:** there is another slack channel for ThirdEye but I am happy to
answer here. Is it ok to create a channel?  
**@neer.shay:** Yes, of course  
**@g.kishore:** 1\. Yes, model can be trained and every feedback is provided
as input to the model to update its parameter 2\. You can configure the
dimensions to explore. By default it only monitors the top level metric 3\.
Yes, it’s configurable. Default is daily 4\. MySQL but any sql data store
should work  
**@neer.shay:** Thank you very much @g.kishore  
 **@pabraham.usa:** Is there any doc detailing comparison of Pinot with
ElasticSearch somewhere?  
**@fx19880617:** @yupeng might have some insight  
**@pabraham.usa:** Could you please highlight the major differences and
**@pabraham.usa:** or point me to any doc?  
**@ken:** Hi Matt - I should probably write up something, since we spent
several months trying to get ES to work for our situation, and eventually
bailing in favor of Pinot.  
**@yupeng:** at high-level ElasticSearch is good for search such as ranking or
text search. Pinot is more efficient on storage and performant for analytical
uses. At Uber, we are migrating the real-time analytical use cases from
ElasticSearch to Pinot  
**@yupeng:** yeah, similarly, we have not published things yet, but will some
time next year  
**@pabraham.usa:** @ken Thanks that would be very useful for people coming
from ES world  
**@mayanks:** @ken that is great to know. Would be awesome if you can blog
about it.  
**@pabraham.usa:** @yupeng Pinot also have a Text Index which I tried as test
and seems performing well. Still have to compare with ES though.  
**@yupeng:** @pabraham.usa when I said text search, I meant to return the
document snapshot in search context, in which ElasticSearch uses a document
model whereas, Pinot builds the index on a text column for not document.  
**@darshants.darshan1:** @g.kishore talks about inverted index @11:01  
**@darshants.darshan1:** The video helped me!  

###  _#random_

###  _#feat-text-search_

 **@pabraham.usa:** @pabraham.usa has joined the channel  

###  _#troubleshooting_

 **@elon.azoulay:** Running java11 with pinot and it works but I get a lot of
classloader exceptions on startup: ```java.lang.IllegalArgumentException:
object is not an instance of declaring class at
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at
~[?:?] at
~[?:?] at java.lang.reflect.Method.invoke( ~[?:?] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
dependencies.jar:0.5.0-7efd67a228a2f40139c768d8a55081e5c9ab1ef5] at
 **@elon.azoulay:** And then the plugin that threw the error loads anyway.  
 **@elon.azoulay:** This is for pinot-0.5.0  
 **@fx19880617:** I think java 11 doesn't support plugin loading stuffs, for
that we need to add all the plugins into classpath  
 **@fx19880617:** which means you don't need to specify PLUGINS_DIR in
**@elon.azoulay:** Somehow it definitely is working. Maybe that's why?
**@elon.azoulay:** @fx19880617, thanks - I checked and plugins.dir is set to
/opt/pinot/plugins and classpath is set to /opt/pinot/lib/* but it seems to be
fine. I see, for example that it's using gcsfs (creating files on gcs) and
that it is ingesting using confluent avro plugin, also did `jmap -histo ...`
and I see that the plugin classes are loaded. So can that error be ignored
**@fx19880617:** ic  
**@fx19880617:** then it could be something else  
**@fx19880617:** I observed similar logs for jetty server , seems to be
**@elon.azoulay:** thanks!  
**@elon.azoulay:** @fx19880617 I think there may have been an issue actually,
was able to fix it, will create pr, lmk what you think.  
**@elon.azoulay:** When you have a chance:) I will add you as reviewer  
**@fx19880617:** sure  
**@fx19880617:** please  
 **@nguyenhoanglam1990:** hi team  
 **@nguyenhoanglam1990:** The realtime table used too much old jvm ... but
could not be cleaned up Why GC could not clear this memory  
 **@nguyenhoanglam1990:** I consume 1 realtime table with 450 million rows ...
and follow jvm to see that the OLD area is using too much memory, but GC does
not clean up and keep this large amount of memory, making the system not
responding to many queries  
 **@nguyenhoanglam1990:** pinot.server.netty.port=8000
pinot.server.query.executor.timeout=90000 .hostname=true { "REALTIME": {
"tableName": "bhx_bhx_forecast_forecast_item_REALTIME", "tableType":
"REALTIME", "segmentsConfig": { "timeType": "MILLISECONDS",
"retentionTimeUnit": "DAYS", "retentionTimeValue": "9125",
"segmentPushFrequency": "DAILY", "segmentPushType": "APPEND", "replication":
"4", "replicasPerPartition": "4", "timeColumnName": "_TIMESTAMP",
"schemaName": "bhx_bhx_forecast_forecast_item" }, "tenants": { "broker":
"DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} },
"tableIndexConfig": { "streamConfigs": { "streamType": "kafka",
"stream.kafka.consumer.type": "lowlevel", "":
"PINOT.BHX.bhx_forecast.forecast_item", "stream.kafka.table.tablename":
"bhx_forecast.forecast_item", "stream.kafka.table.part.pattern": "_[0-9]+",
"stream.kafka.cdc.format": "CDC", "":
"com.mwg.pinot.realtime.KafkaCDCConsumerFactory", "":
"": "smallest",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "10m",
"realtime.segment.flush.threshold.segment.size": "5M", "":
"bhx_bhx_forecast.forecast_item-PINOT_INGESTION", "max.partition.fetch.bytes":
"167772160", "receive.buffer.bytes": "67108864", "isolation.level":
"read_committed", "max.poll.records": "5000" }, "enableDefaultStarTree":
false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false,
"nullHandlingEnabled": false, "autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP" },
"metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType":
"strictReplicaGroup" }, "instanceAssignmentConfigMap": { "CONSUMING": {
"tagPoolConfig": { "tag": "inventory_REALTIME", "poolBased": false,
"numPools": 0 }, "replicaGroupPartitionConfig": { "replicaGroupBased": true,
"numInstances": 0, "numReplicaGroups": 4, "numInstancesPerReplicaGroup": 5,
"numPartitions": 0, "numInstancesPerPartition": 0 } } }, "upsertConfig": {
"mode": "FULL" } } }  
 **@nguyenhoanglam1990:** PINOT_JAVA_OPTS=-Xmx180g -Xms16G
-Dlog4j2.configurationFile=conf/pinot-admin-log4j2.xml -XX:+UseG1GC
-XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=10
-XX:G1MaxNewSizePercent=20 -XX:G1HeapRegionSize=32M -XX:G1Re servePercent=5
-XX:G1HeapWastePercent=2 -XX:G1MixedGCCountTarget=3 -XX:+AlwaysPreTouch
-XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+ParallelRefProcEnabled
-XX:MaxGCPauseMillis=200 -XX:G1MixedGCLiveThresholdPercent=35 -XX:G
1RSetUpdatingPauseTimePercent=5 -XX:SurvivorRatio=32
-XX:MaxTenuringThreshold=1 -XX:InitiatingHeapOccupancyPercent=30
-XX:-G1UseAdaptiveIHOP -XX:+UseStringDeduplication -XX:+PerfDisableSharedMem
-XX:ParallelGCThreads=12 -XX:ConcGCT hreads=6
 **@nguyenhoanglam1990:** please help me  
**@fx19880617:** @jackie.jxt do we have any specific memory requirements for
upsert case?  
**@fx19880617:** @yupeng ^^  
**@fx19880617:** we tried to turn this on, but it doesn't work as well:
**@yupeng:** not sure if this is related to upsert  
**@yupeng:** only upsert metadata is on heap  
**@yupeng:** the rest is the same as normal segments  
**@yupeng:** it would be helpful to use debug endpoint to display the memory
**@fx19880617:** oh ? what's this endpoint  
**@yupeng:** `MmapDebugResource`  
**@fx19880617:** @nguyenhoanglam1990 can you try this  
**@nguyenhoanglam1990:** How can I run this "MmapDebugResource"  
**@fx19880617:** should be on server admin port  
**@yupeng:** `debug/memory/offheap/table/{tableName}`  
**@nguyenhoanglam1990:** @yupeng  
**@jackie.jxt:** For upsert, there is a concurrent map storing the mapping
from primary key to record location, which is on heap  
**@jackie.jxt:** If the cardinality of the primary key is not extremely high,
it should be fine  
**@jackie.jxt:** @nguyenhoanglam1990 Is this the server rest port?  
**@nguyenhoanglam1990:** yes  
**@nguyenhoanglam1990:** @jackie.jxt help me  
**@jackie.jxt:** This seems like the controller rest port  
**@yupeng:** that shows your controller  
**@yupeng:** run it on server  
**@nguyenhoanglam1990:** Can't show @yupeng  
**@yupeng:** 8000 is netty  
**@yupeng:** not rest  
**@jackie.jxt:** 8030 is the rest port per the config  
**@nguyenhoanglam1990:** not found @jackie.jxt  
**@jackie.jxt:** Is the host a pinot server?  
**@nguyenhoanglam1990:** yes  
**@nguyenhoanglam1990:** debug/memory/offheap/table/{tableName} path is corr
**@nguyenhoanglam1990:** @jackie.jxt  
**@jackie.jxt:** The path is correct  
**@jackie.jxt:** Can you also try `/debug/memory/offheap`?  
**@jackie.jxt:** Can you also share the schema of the table?  
**@nguyenhoanglam1990:** { "schemaName": "bhx_bhx_forecast_forecast_item",
"dimensionFieldSpecs": [ { "name": "forecastpurchase", "dataType": "DOUBLE" },
{ "name": "createduser", "dataType": "STRING" }, { "name": "inputquantity",
"dataType": "DOUBLE" }, { "name": "forecast", "dataType": "DOUBLE" }, {
"name": "forecastnopromotion", "dataType": "DOUBLE" }, { "name": "storeid",
"dataType": "LONG" }, { "name": "storequantity", "dataType": "DOUBLE" }, {
"name": "isdeleted", "dataType": "INT" }, { "name": "deleteduser", "dataType":
"STRING" }, { "name": "date_key", "dataType": "LONG" }, { "name": "itemid",
"dataType": "STRING" }, { "name": "sellquantity", "dataType": "DOUBLE" }, {
"name": "forecast15", "dataType": "DOUBLE" }, { "name": "forecast8",
"dataType": "DOUBLE" }, { "name": "forecastnopromo", "dataType": "DOUBLE" }, {
"name": "branchquantity", "dataType": "DOUBLE" }, { "name": "updateduser",
"dataType": "STRING" }, { "name": "_DELETED", "dataType": "INT" } ],
"dateTimeFieldSpecs": [ { "name": "createddate", "dataType": "LONG", "format":
"1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" }, { "name":
"deleteddate", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS" }, { "name": "updateddate", "dataType":
"LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" }, {
"name": "_TIMESTAMP", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS" } ], "primaryKeyColumns": [ "itemid",
"storeid", "date_key" ] }  
**@nguyenhoanglam1990:** { "REALTIME": { "tableName":
"bhx_bhx_forecast_forecast_item_REALTIME", "tableType": "REALTIME",
"segmentsConfig": { "timeType": "MILLISECONDS", "retentionTimeUnit": "DAYS",
"retentionTimeValue": "9125", "segmentPushFrequency": "DAILY",
"segmentPushType": "APPEND", "replication": "4", "replicasPerPartition": "4",
"timeColumnName": "_TIMESTAMP", "schemaName": "bhx_bhx_forecast_forecast_item"
}, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant",
"tagOverrideConfig": {} }, "tableIndexConfig": { "streamConfigs": {
"streamType": "kafka", "stream.kafka.consumer.type": "lowlevel",
"": "PINOT.BHX.bhx_forecast.forecast_item",
"stream.kafka.table.tablename": "bhx_forecast.forecast_item",
"stream.kafka.table.part.pattern": "_[0-9]+", "stream.kafka.cdc.format":
"CDC", "":
"com.mwg.pinot.realtime.KafkaCDCConsumerFactory", "":
"": "smallest",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "60m",
"realtime.segment.flush.threshold.segment.size": "500M", "":
"bhx_bhx_forecast.forecast_item-PINOT_INGESTION", "max.partition.fetch.bytes":
"167772160", "receive.buffer.bytes": "67108864", "isolation.level":
"read_committed", "max.poll.records": "5000" }, "noDictionaryColumns": [],
"onHeapDictionaryColumns": [], "varLengthDictionaryColumns": [],
"enableDefaultStarTree": false, "starTreeIndexConfigs": [],
"enableDynamicStarTreeCreation": false, "aggregateMetrics": false,
"nullHandlingEnabled": false, "autoGeneratedInvertedIndex": false,
"createInvertedIndexDuringSegmentGeneration": false, "sortedColumn": [],
"bloomFilterColumns": [], "loadMode": "MMAP", "rangeIndexColumns": [] },
"metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType":
"strictReplicaGroup" }, "instanceAssignmentConfigMap": { "CONSUMING": {
"tagPoolConfig": { "tag": "inventory_REALTIME", "poolBased": false,
"numPools": 0 }, "replicaGroupPartitionConfig": { "replicaGroupBased": true,
"numInstances": 0, "numReplicaGroups": 4, "numInstancesPerReplicaGroup": 5,
"numPartitions": 0, "numInstancesPerPartition": 0 } } }, "upsertConfig": {
"mode": "FULL" } } }  
**@nguyenhoanglam1990:** @jackie.jxt schema and table config above  
**@jackie.jxt:** I think the issue is that the primary key (`itemid, storid,
date_key`) is almost always unique, which will make the key map very big  
**@jackie.jxt:** What's the purpose of enabling upsert for this table?  
**@nguyenhoanglam1990:** This table of data is recalculated daily  
**@jackie.jxt:** I don't follow  
**@jackie.jxt:** Do you need to replace the data every day, or just append the
data for new day?  
**@jackie.jxt:** @yupeng @nguyenhoanglam1990 This rest endpoint
`debug/memory/offheap/table/{tableName}` is added recently (), and not
included in the latest release  
**@fx19880617:** I think all the recomputed data are also pushed to kafka  
**@fx19880617:** hence the upsert  
**@yupeng:** oh, i see. then use `memory/offheap` i think  
**@jackie.jxt:** The cardinality of the primary key is unbounded, which will
make the upsert metadata map size unbounded  
**@yupeng:** it’s bounded by the msgs consumed?  
**@jackie.jxt:** ``` "retentionTimeUnit": "DAYS", "retentionTimeValue":
"9125",``` About 25 years data lol  
**@jackie.jxt:** If we want to re-compute the records for the previous day to
fix the data every day, we should use the hybrid table approach, which is
designed for this  
**@nguyenhoanglam1990:** ok tks everybody @fx19880617 @yupeng @jackie.jxt  
**@nguyenhoanglam1990:** got it  
 **@nguyenhoanglam1990:** The odd thing is that the segment is stored in the
heap instead of the off-heap like configuring = true  

###  _#pinot-dev_

###  _#pinot-0-5-0-release_

###  _#pinot-perf-tuning_

 **@elon.azoulay:** Upgraded to java11, letting it run for a few hours  
 **@elon.azoulay:** Do you recommend we separate workloads into 2 clusters: we
have a realtime only operational workload with very tight sla, and an analytic
workload where ppl are always bulk inserting data, redoing indexes, etc. more
adhoc. Since our cluster is only 6 nodes we thought it would be too small for
multi tenant setup.  
 **@elon.azoulay:** We see a lot of DirectR buffers for the offline analytic
tables which have very large retentions, but the realtime only tables have 1
week retentions and grow relatively slowly.  
 **@steotia:** Sorry just catching up here. Question : is the issue related to
the heap overhead and gc caused by a large number of references for direct
byte buffers? The references cleanup are subject to normal gc cycle even
though the off heap memory pointed to by them is freed?  
 **@g.kishore:** Where are creating these?  
**@g.kishore:** Can we verify it’s from that and nothing else.  
**@elon.azoulay:** I did heap dumps and they were all coming from there, also
looked through the code - all soft references to array of soft references of
**@elon.azoulay:** But maybe they are created for other use cases we don't
 **@npawar:** @npawar has joined the channel  
 **@elon.azoulay:** Hi, does anyone have a recommendation regarding when it is
beneficial to use separate clusters? We are exploring whether to separate
realtime only, super low latency (queries must return in < 2 seconds),
business critical tables with fixed size/slowly growing segments vs.
hybrid/offline, very fast growing, less critical but higher latency (i.e.
queries can return in ~5seconds) . We only have 6 nodes currently, so until we
really scale up, it makes more sense to have the critical tables in one
cluster and analytic tables in another.  
 **@elon.azoulay:** Does that make sense? i.e. due to analytic tables
constantly changing indexing, exploring data, etc. vs realtime where the
result of a query can trigger an alert (i.e. anomaly detection)  
 **@elon.azoulay:** Or does everyone use 1 cluster?  
**@chinmay.cerebro:** Did you mean separate clusters or separate tenants ? You
could have dedicated tenants for critical use cases  
**@jackie.jxt:** You may have separate tenants inside a single cluster, and
each tenant can be configured differently. The controllers and brokers can be
**@elon.azoulay:** Ah, so we can have "critical realtime", "analytic
realtime", "analytic offline" and isolate the workloads like that?  
**@mayanks:** Yes  
**@elon.azoulay:** Trying it now:)  
**@elon.azoulay:** Also will submit the pr for java11 plugins. Got it working
for us.  
**@elon.azoulay:** You are all the best, thanks for all the help! I hope
contributing back is helpful to all of you.  
**@elon.azoulay:** #1 oss community!  
 **@pabraham.usa:** Hello, I deployed Pinot in k8s cluster and can see the
memory is reported incorrectly actually v high. top command inside the node
and docker stats are showing correct usage ~ 2GB. However K8s and prom metrics
all are reporting high usage(25G).  
**@fx19880617:** is it memory mapped size?  
**@fx19880617:** can you give a snapshot of the metric screen  
**@fx19880617:** and the metrics name  
**@pabraham.usa:** Thanks @fx19880617, it is the memory . I can get the actual
metric name  
**@pabraham.usa:** @fx19880617  
**@pabraham.usa:** Top mem usage is 2.7% 30302 root 20 0 60.4g 884620 155640 S
0.7 2.7 1:31.34 java docker stats 949a16209c01 k8s_server_pinot-
server-0_log_65bf2d4c-59ce-464e-9533-2121d4e78036_0 0.40% 735.6MiB / 26GiB
2.76% 0B / 0B 0B / 0B 88 kubectl metrics “metadata”:{“name”:“pinot-
**@fx19880617:** ic  
**@fx19880617:** i think k8s is reporting the container memory usage  
**@fx19880617:** pinot memory mapped data files, that might take the container
memory as cache  
**@pabraham.usa:** ohh ok so the mapped data files which is in disk is also
reported as memory usage in container right?  
**@pabraham.usa:** if it cross the req limit the container will get killed. If
I raise the mem req too high pod wont be scheduled as I have to allocate v
high spec node.  
**@pabraham.usa:** potentially a Java mem reporting issue for k8s rather than
Pinot I guess  
**@fx19880617:** yes  
**@fx19880617:** did you observed the pod got killed?  
**@fx19880617:** it shouldn’t cross the container limit  
**@fx19880617:** cached data will be cleaned  
**@pabraham.usa:** almost got killed, 26G is the limit the mem was on 25G..  
**@pabraham.usa:** ohh ok, it will get cleaned by itself?  
###  _#thirdeye-pinot_

 **@g.kishore:** @neer.shay  
 **@neer.shay:** Thanks for creating this channel @g.kishore  
 **@neer.shay:** Hi, has anyone done a comparison between Sherlock/Druid &
ThirdEye/Pinot that they can share their results? Thanks!  

###  _#getting-started_

