You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2022/04/13 02:00:29 UTC

Apache Pinot Daily Email Digest (2022-04-12)

### _#general_

  
 **@zliu:** Hi, How can I debug the pinot project with idea locally?  
**@npawar:** did you see this ?  
**@zliu:** ok, thanks  
**@zliu:** In the ‘Starting Pinot via IDE’ section, it is not written in
detail. Is there a more detailed introduction?  
**@npawar:** Which part in particular? The only step there is starting
BatchQuickstart class  
**@zliu:** It is better to start “controller”, “broker”, “server” in steps  
 **@ysuo:** Hi there, are there some background tasks Pinot does every 30
mins?  
**@mayanks:** There are several background tasks that happen at configurable
interval (eg retention).  
**@npawar:** @mark.needham Another question related to controller periodic
tasks, fyi  
**@npawar:** Hi @ysuo added this page explainign all the periodic tasks the
controller does in the background  
**@mayanks:** Thanks @npawar for adding this.  
 **@wcxzjtz:** just wondering if it is possible to setup multiple h3
resolutions for a geometry column? if so, wondering how it works? like during
query, how does pinot choose which resolution to use? :grinning: thanks.  
**@yupeng:** Yes, it's possible  
**@yupeng:** It takes an array in index config  
**@yupeng:** right now it uses lowest resolution  
**@wcxzjtz:** Gotcha. Thanks.  
**@sunhee.bigdata:** Hi, I created realtime partition table.
```"tableIndexConfig": { "segmentPartitionConfig": { "columnPartitionMap": {
"subject": { "functionName": "murmur", "numPartitions": 3 } } }, ``` And then
add kafka topic partition (3->4) and produce data to kafka new partition. But
there is no new segment in pinot. So it doesn’t show data in kafka new
partition. Although changing configuration numPartitions (3->4) in pinot and
rebalance servers, the result is same. It seems that there is no problem in
realtime table (none partition). After adding kafka partition and then produce
data to new partition, new segment is added in pinot. so It shows data in
kafka new partition. Is it normal case? Otherwise, what should I check? Thanks
:)  
**@mayanks:** I am a bit unclear in the question. Are you observing that going
from no partition to N partitions in an RT table there is no issue. But when
you go from N partitions to N+1, the new partition doesn’t show up?  
**@mayanks:** cc: @npawar on when does pinot pick up the new partition (is it
a periodic task, or can it be triggered via rebalance)?  
**@npawar:** periodic task, runs hourly.  
**@npawar:** @mark.needham would you please help adding all the periodic tasks
and what they do to this page?  We only have all config listed here so far,
which isn’t the most helpful when someone doesn’t know about the periodic
tasks at all  
**@npawar:** @sunhee.bigdata here’s some more info about the periodic task on
the controller, which adds the new partitions:  
**@npawar:** you can also trigger manually if needed, details in the doc  
 **@satyam.raj:** hey guys, I’m trying to do batch-ingestion of ORC files from
S3 to pinot using the spark batch job. ```export PINOT_VERSION=0.10.0 export
PINOT_DISTRIBUTION_DIR=/Users/satyam.raj/dataplatform/pinot-dist/apache-
pinot-0.10.0-bin bin/spark-submit \ \--class
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand \ \--master
"local[8]" \ \--conf
"spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins
-Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-
log4j2.xml" \ \--conf
"spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/plugins/pinot-batch-
ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-
spark-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/lib/pinot-
all-${PINOT_VERSION}-jar-with-
dependencies.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-
system/pinot-s3/pinot-s3-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-
input-format/pinot-parquet/pinot-
parquet-${PINOT_VERSION}-shaded.jar:${PINOT_DISTRIBUTION_DIR}/plugins/pinot-
file-system/pinot-hdfs/pinot-hdfs-${PINOT_VERSION}-shaded.jar" \
${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-jar-with-
dependencies.jar \ -jobSpecFile '/Users/satyam.raj/dataplatform/pinot-
dist/batchjob-spec/batch-job-spec.yaml'``` Getting the below weird error:
```Exception in thread "main" java.lang.VerifyError: Bad type on operand stack
Exception Details: Location:
org/apache/spark/metrics/sink/MetricsServlet.<init>(Ljava/util/Properties;Lcom/codahale/metrics/MetricRegistry;Lorg/apache/spark/SecurityManager;)V
@116: invokevirtual Reason: Type 'com/codahale/metrics/json/MetricsModule'
(current frame, stack[2]) is not assignable to
'com/fasterxml/jackson/databind/Module' Current Frame: bci: @116 flags: { }
locals: { 'org/apache/spark/metrics/sink/MetricsServlet',
'java/util/Properties', 'com/codahale/metrics/MetricRegistry',
'org/apache/spark/SecurityManager' } stack: {
'org/apache/spark/metrics/sink/MetricsServlet',
'com/fasterxml/jackson/databind/ObjectMapper',
'com/codahale/metrics/json/MetricsModule' } Bytecode: 0000000: 2a2b b500 2a2a
2cb5 002f 2a2d b500 5c2a 0000010: b700 7e2a 1280 b500 322a 1282 b500 342a
0000020: 03b5 0037 2a2b 2ab6 0084 b600 8ab5 0039 0000030: 2ab2 008f 2b2a b600
91b6 008a b600 95bb 0000040: 0014 592a b700 96b6 009c bb00 1659 2ab7 0000050:
009d b600 a1b8 00a7 b500 3b2a bb00 7159 0000060: b700 a8bb 00aa 59b2 00b0 b200
b32a b600 0000070: b5b7 00b8 b600 bcb5 003e b1 at
java.base/java.lang.Class.forName0(Native Method) at
java.base/java.lang.Class.forName(Class.java:398) at
org.apache.spark.util.Utils$.classForName(Utils.scala:238) at
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:200)
at
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:196)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at
scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at
org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:196)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:104) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:514) at
org.apache.spark.SparkContext.<init>(SparkContext.scala:117) at
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2550) at
org.apache.spark.SparkContext.getOrCreate(SparkContext.scala) at
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner.run(SparkSegmentGenerationJobRunner.java:196)
at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:146)
at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:125)
at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:121)
at org.apache.pinot.tools.Command.call(Command.java:33) at
org.apache.pinot.tools.Command.call(Command.java:29) at
picocli.CommandLine.executeUserObject(CommandLine.java:1953) at
picocli.CommandLine.access$1300(CommandLine.java:145) at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) at
picocli.CommandLine$RunLast.handle(CommandLine.java:2311) at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078) at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:153)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566) at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:855) at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930) at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939) at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)```  
**@mayanks:** What version of spark? Also cc: @xiangfu0 for any inputs  
**@mayanks:** @kharekartik ^^  
**@kharekartik:** Hi satyam, can you also mention the Java version you are
using  
 **@satyam.raj:** Can someone help figure this out?  
 **@easuncion:** @easuncion has joined the channel  
 **@gfeldman8:** @gfeldman8 has joined the channel  
 **@paul:** @paul has joined the channel  
 **@padma:** Hi, I have time series data flowing in a kafka stream that I am
ingesting into Pinot using real time ingestion technique. I have a 48 hour
data retention. Data volume is around 6 TB for 48 hours. we have created an
inverted index for one of the filter attributes. Query performance is about 18
seconds when we query the data for that filter. However, there is another
query parameter of timestamps that are range filters and have no indexes
created currently as we thought that the segments will be created based on the
time attribute defined in the table config. Is that a correct assumption or do
you suggest creating a sorted index for the timestamp attributes? What kind of
hardware can be ideal for getting <100 ms query performance time. Query
performance is running into seconds initially and then reduces significantly
and runs into 200 ms. So, holding the data into memory is key, but we cant
have 6TB of memory across 43 pods we have allocated to pinot servers. Each
pinot server is configured to have about 28-32 gb out of which 50% is
allocated to JVM heap and the rest to the memory mapping  
**@richard892:** Hi @padma you don't need to sort on time, try to reserve
sorting for your most filtered column since it's generally the best option in
terms of space/time  
**@richard892:** for time you can add a range index  
**@richard892:** also how is your kafka topic partitioned? Have you configured
partitioning so queries can make use of it?  
**@richard892:**  
**@padma:** Currently no. But, thats in works currently.  
**@padma:** So, do you think we need to add range index for Timestamp even
though we configured the timestamp field as the time attribute in the table
config?  
**@richard892:** Yes I believe you need to add it to the range index columns  
**@richard892:** however, there's a new feature which might be interesting to
you available on latest builds - timestamp index which @xiangfu0 wrote
recently  
**@richard892:** maybe you could experiment with it, otherwise range index
should be good  
**@padma:** WE cannot upgrade Pinot atm  
**@padma:** @richard892 what is the significance of this configuration in the
schema?  
**@padma:** ```"dateTimeFieldSpecs": [ { "name": "timestamp_ms", "dataType":
"LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:HOURS" }```  
**@padma:** I thought this is used for dividing up the data into different
segments and hence there wont be a need for creating a range index for the
datetime field, which in our case is the timestamp field  
**@mayanks:** That config is specifying the time column(s). It is not used to
time partition the segments.  
 **@chris.zhou:** @chris.zhou has joined the channel  

###  _#random_

  
 **@easuncion:** @easuncion has joined the channel  
 **@gfeldman8:** @gfeldman8 has joined the channel  
 **@paul:** @paul has joined the channel  
 **@chris.zhou:** @chris.zhou has joined the channel  

###  _#troubleshooting_

  
 **@sumit.l:** Hi team, we are trying to integrate thirdeye with our pinot
cluster using  and  and we can access the dashboard on port 1426 but unable to
see any pinot data set. What other steps are involved in this integration ?  
**@nair.a:** @mayanks @xiangfu0 can someone help us?  
**@mayanks:** @pyne.suvodeep ^^  
 **@sumit.l:**  
 **@easuncion:** @easuncion has joined the channel  
 **@francois:** Hi little question about hybrid table ... i’ve manage to purge
sucessfully my offline table (big thanks to @mayanks and @jlli ) Segment size
0 docs :slightly_smiling_face: Pretty happy. But ... I keep see the purged
record as they remains in the realtime table :confused: The retention still a
bit long. Is there a way to clear either using segment metadata on the offline
side or deleteting processed realTimetoOffline segments on the realTimeSide ?
Many thanks :wink:  
**@mayanks:** What is your retention in real-time? Also how often do you push
to offline ? If daily, then the time boundary should take care of serving one
day old data from offline and not real-time  
**@francois:** Retention is not set on realtime table :confused: So if I
schedule task realtime to offline close to my retention times (let say
retention time = realtTimeToOffline schedule + 1 hour gap ) it should do the
trick. Thx for bringing this to my attention :wink:  
**@mayanks:** Yes  
 **@gfeldman8:** @gfeldman8 has joined the channel  
 **@paul:** @paul has joined the channel  
 **@chris.zhou:** @chris.zhou has joined the channel  

###  _#pinot-k8s-operator_

  
 **@bagi.priyank:** @bagi.priyank has joined the channel  

###  _#thirdeye-pinot_

  
 **@nair.a:** @nair.a has joined the channel  
 **@sumit.l:** @sumit.l has joined the channel  
 **@sanjay.a:** @sanjay.a has joined the channel  

###  _#getting-started_

  
 **@easuncion:** @easuncion has joined the channel  
 **@gfeldman8:** @gfeldman8 has joined the channel  
 **@paul:** @paul has joined the channel  
 **@chris.zhou:** @chris.zhou has joined the channel  
 **@bagi.priyank:** Hello, we are planning to query pinot tables via airflow
dags. We found  airflow connector. Is it advised to use the connector or
should we use ? Are there any pros and cons between pinotdb api and sql
alchemy?  
**@mayanks:** I think I remember someone mentioning they are using it. But
that is pretty much I have on that. Will wait for others to chime in  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org