You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/08/21 02:00:18 UTC
Apache Pinot Daily Email Digest (2021-08-20)

### _#general_

  
 **@hasanat.muztaba:** @hasanat.muztaba has joined the channel  
 **@mark.r.watkins:** @mark.r.watkins has joined the channel  
 **@banlinhlinhcon:** @banlinhlinhcon has joined the channel  
 **@diogo.baeder:** @diogo.baeder has joined the channel  

###  _#random_

  
 **@hasanat.muztaba:** @hasanat.muztaba has joined the channel  
 **@mark.r.watkins:** @mark.r.watkins has joined the channel  
 **@banlinhlinhcon:** @banlinhlinhcon has joined the channel  
 **@diogo.baeder:** @diogo.baeder has joined the channel  

###  _#troubleshooting_

  
 **@nadeemsadim:** can we apply inverted indexing or range indexing to
existing old table columns with tens of millions of records and what impact it
will have on the table performance and should I expect query latency to
improve for old data queried as well ..  
**@xiangfu0:** it will improve the queries with predicates on the columns you
put indexes  
**@xiangfu0:** you can apply it on old tables then reload all the segments  
**@nadeemsadim:** what im,pact wil it have on heap memory  
**@xiangfu0:** there will be extra disk space overhead  
**@nadeemsadim:** will i require to increase heap memory since indexing will
be added separately as bitmap in heap memory  
**@nadeemsadim:** will it be stored in ram or ssd  
**@xiangfu0:** index are on ssd, runtime pinot will load data/indexes  
**@nadeemsadim:** ok once processed .. they will be written on ssd and gc will
clear the heap eventually for hot segments which are being consumed?  
**@xiangfu0:** right  
**@xiangfu0:** it will be off-heap  
**@nadeemsadim:** how difficult is it to apply lucene fst indexing for regexp
queries  
**@nadeemsadim:** I mean i onl;y need to apply the indexing on the string
columns or anything else also needed  
**@xiangfu0:** yes  
**@xiangfu0:** apply index, then it will improve the text search query  
**@xiangfu0:** but you pay extra disk space for index  
**@nadeemsadim:** what is the preferred or recommended indexing for substring
while doing group by and filter .. lucene fst?  
**@xiangfu0:** lucene will help on filter not group by  
**@nadeemsadim:** ok filter will prune a lot of segments / records  
**@nadeemsadim:** that will also improve latency drastically i guess  
**@nadeemsadim:** since we are also doing filter with substring on a string
column  
**@nadeemsadim:** so lucene fst on that string column where we are doing
substring while filter will help?  
**@nadeemsadim:** *apply index, then it will improve the text search query* =>
will regexp query or substring query performance wont improve if i use lucene
fst and only will improve if i use text_match clause ?  
**@nadeemsadim:** cc: @mayanks @jackie.jxt @ssubrama @g.kishore @npawar @ken
@dlavoie  
**@nadeemsadim:** cc: @mohamedkashifuddin @shaileshjha061 @hussain  
**@xiangfu0:** it will improve the text_match  
**@xiangfu0:** substring won’t get benefit or regex  
**@nadeemsadim:** ok  
**@nadeemsadim:** if time column is primary time column in schema and stored
as long data type in milliseconds epoch with granularity as seconds .. is it
necessary to use range indexing on time column on my tables if all queries
have filter(where clause) with respect to time say last 5 min or last 1 hour
or last 30 days @xiangfu0.. I mean by default segments do have start time and
end time if time is primary time column I GUESS .. and segment pruning will
happen by default or range indexing on time column is needed?  
**@xiangfu0:** segment prune will take care of start/end time if it’s in your
predicate  
**@nadeemsadim:** so range indexing needed ?  
**@xiangfu0:** no  
**@nadeemsadim:** ok got it  
**@xiangfu0:** it will help, but since your segments are ordered by time
already, so I feel it’s not necessary to put index  
**@xiangfu0:** do you have sort index  
**@xiangfu0:** you can sort time  
**@nadeemsadim:** ok  
**@nadeemsadim:** but the time coming on kafka topic may not be sorted .. I
mean some unordered /unsorted time may come while ingestion  
**@nadeemsadim:** so is sorted index recommeded in such a scenario?  
**@nadeemsadim:** for time column  
**@xiangfu0:** pinot will sort data when it’s persisting the data to disk  
**@xiangfu0:** but you can only pick one sorted index  
**@nadeemsadim:** ok then time column i will select as sorted indexing column  
**@nadeemsadim:** then range indexing wont be needed on time column .. right  
**@nadeemsadim:** also sorted indexing will internally also be doing inverted
indexing  
**@nadeemsadim:** also one last doubt .. inverted indexing performance will be
same for string type columns and long/int columns or int/long columns are
going to give better performance even after indexing applied .. my use case is
I have some tenantId or other ID columns as string but they are digits only ..
so is it necessary to keep them as long type in schema and then only apply
inverted indexing or I can even use string data type with inverted indexing
and it wont have much impact on query latency while filtering or group by ..  
**@xiangfu0:** you can use string, but you will pay more storage cost  
**@nadeemsadim:** latency wise not much impact?  
**@xiangfu0:** if you can use long type to represent id, then suggest to use
long  
**@nadeemsadim:** ok  
**@xiangfu0:** latency wise, using long will definitely help  
**@nadeemsadim:** cool  
**@ssubrama:** @nadeemsadim please use the controller endpoint to recommend
indexes. As for inverted indexes, the only time they are on heap is when a
segment is consming state. Once segments are completed (or they are offline
segments) there is nothing stored in heap  
**@nadeemsadim:** Ok @ssubrama got it  
 **@hasanat.muztaba:** @hasanat.muztaba has joined the channel  
 **@mark.r.watkins:** @mark.r.watkins has joined the channel  
 **@banlinhlinhcon:** @banlinhlinhcon has joined the channel  
 **@31cswarszawa:** Hey everyone! I’m working on creating a realtime pinot
table that is based on messages published on kafka topic. My goal is to see
messages pushed to kafka in pinot table as fast as it is possible. Here’s my
current table config: ``` "tableIndexConfig": { "loadMode": "MMAP",
"streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type":
"simple", "stream.kafka.topic.name": "bb8_logs",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.zk.broker.url": "zookeeper:2181/kafka",
"stream.kafka.broker.list": "kafka:9092",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "1h",
"realtime.segment.flush.desired.size": "50M",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }``` Wanted to
ask you what should be changed to see kafka topic messages in pinot table as
fast as it is possible?  
**@nadeemsadim:** use "stream.kafka.consumer.type": "lowlevel",  
**@nadeemsadim:** and give more partitions of the kafkatopic  
**@nadeemsadim:** pinot throughput should increase considerably  
**@nadeemsadim:** if your p[roducer app is writing data on all partitions of
the kafka topic from where pinot is consuming  
**@31cswarszawa:** How can I increase a number of partitions per topic?  
**@31cswarszawa:** Probably there’s some config I’m missing  
**@nadeemsadim:** on kafka  
**@nadeemsadim:** u can use alter partitions command  
**@nadeemsadim:** ```bin/kafka-topics.sh --zookeeper zk_host:port/chroot
--alter --topic my_topic_name \--partitions 40 ```  
**@31cswarszawa:** But currently I can consume messages from kafka topic right
after they’re published (with kafkacat). I think that lag is on a Pinot table
side.  
**@nadeemsadim:** yeah  
**@31cswarszawa:** So number of partitions will improve this?  
**@31cswarszawa:** (I though that kafka setting does not influence how pinot
consumes data)  
**@nadeemsadim:** yep hopefully  
**@nadeemsadim:** otherwise some other issue  
**@31cswarszawa:** yep, your change worked like a charm!  
**@31cswarszawa:** Many thanks:blush:  
**@nadeemsadim:** welcome  
**@ssubrama:** @31cswarszawa increasing the number of partitions in kafka
should be done carefully. Pinot has per-partition overheads, so it is not wise
to just set the number of partition to (say) 512. Ideally, you want to start
with a small number of partitions an dincrease it  
**@ssubrama:** kafka does not allow decrease of partitions  
**@ssubrama:** @31cswarszawa run the realtime prov tool. Try with different
number of partitions, and see what the mem usage looks like. And then you can
decide on the number ofpartitions. Data is available instantly after it is in
kafka. If you are not seeing that, then increasing kafka partitions is not
going to do you any good.  
 **@diogo.baeder:** @diogo.baeder has joined the channel  
 **@gqian3:** Hi team, we are trying to do aggregation on a string field, e.g
select name, max(url) from table group by name. But resulting numformatting
exception. Is there any other way we can get one aggregated string value from
a group? Thanks  
**@mayanks:** How about distinct + order by?  
**@gqian3:** Yes we tried that, but the problem is one name can map to
multiple url, so it would return multiple rows for each distinct name. We were
hoping to get just one url from the group by name.  
**@mayanks:** @jackie.jxt we should be able to allow applicable aggr functions
(max,min) on String columns?  
 **@nisheetkmr:** Hi, I am trying to load parquet files using spark ingestion
task. I had build the jars for java 8 using command ```mvn install package
-DskipTests -DskipiTs -Pbin-dist -Drat.ignoreErrors=true -Djdk.version=8
-Dspark.version=2.4.5``` While running the task i getting the error
```21/08/20 15:11:24 ERROR LaunchDataIngestionJobCommand: Exception caught:
Can't construct a java object for
tag:,2002:org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec;
exception=Class not found:
org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec in
'string', line 1, column 1: executionFrameworkSpec: ^ at
org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:349)
at
org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
at
org.yaml.snakeyaml.constructor.BaseConstructor.constructDocument(BaseConstructor.java:141)
at
org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450) at
org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:427) at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.getSegmentGenerationJobSpec(IngestionJobLauncher.java:94)```
I have loaded all the jars in spark class path. Any idea how to resolve this??  
**@nisheetkmr:** Ingestion spec: ```executionFrameworkSpec: name: 'spark'
segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner'
extraConfigs: stagingDir: 's3://<bucket>/<staging_path>/' jobType:
SegmentCreationAndTarPush inputDirURI: 's3://<bucket>/<parquet_file_path>/'
includeFileNamePattern: 'glob:**/*.parquet' outputDirURI:
's3://<bucket>/<path>/' overwriteOutput: true pinotFSSpecs: \- scheme: s3
className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: region:
'<region>' recordReaderSpec: dataFormat: 'parquet' className:
'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec:
tableName: '<pinot_table_name>' pinotClusterSpecs: \- controllerURI:
'<controller_uri>' pushJobSpec: pushParallelism: 2 pushAttempts: 2
pushRetryIntervalMillis: 1000```  
**@nisheetkmr:** I have checked spark cluster logs and I can see that the jars
are present and I tried running the command
`Class.forName("org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec")`
using notebook and it ran the command successfully Cluster spark conf:
```spark.driver.extraClassPath=/home/yarn/pinot/plugins/pinot-batch-
ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-
spark-0.8.0-shaded.jar:/home/yarn/pinot/lib/pinot-all-0.8.0-jar-with-
dependencies.jar:/home/yarn/pinot/plugins/pinot-file-
system/pinot-s3/pinot-s3-0.8.0-shaded.jar:/home/yarn/pinot/plugins/pinot-
input-format/pinot-parquet/pinot-
parquet-0.8.0-shaded.jar:/home/yarn/pinot/pinot-spi-0.8.0-SNAPSHOT.jar
spark.driver.extraJavaOptions=-Dplugins.dir=/home/yarn/pinot/plugins
-Dplugins.include=pinot-s3,pinot-parquet``` In logs i can see the jars loaded
```21/08/20 15:10:25 INFO DriverCorral: Successfully attached library
s3a://<jar_path>/pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar to Spark
....... ....... 21/08/20 15:11:21 WARN SparkContext: The jar
/local_disk0/tmp/addedFile4953461200729207388pinot_all_0_8_0_SNAPSHOT_jar_with_dependencies-2dd68.jar
has been added already. Overwriting of added jars is not supported in the
current version.``` And the plugins are also successfully loaded ```21/08/20
15:11:23 INFO PluginManager: Plugins root dir is [/home/yarn/pinot/plugins]
21/08/20 15:11:23 INFO PluginManager: Trying to load plugins: [[pinot-s3,
pinot-parquet]] 21/08/20 15:11:23 INFO PluginManager: Trying to load plugin
[pinot-parquet] from location [/home/yarn/pinot/plugins/pinot-input-
format/pinot-parquet] 21/08/20 15:11:23 INFO PluginManager: Successfully
loaded plugin [pinot-parquet] from jar files:
[file:/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-
parquet-0.8.0-SNAPSHOT-shaded.jar] 21/08/20 15:11:23 INFO PluginManager:
Successfully Loaded plugin [pinot-parquet] from dir
[/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet] 21/08/20 15:11:23
INFO PluginManager: Trying to load plugin [pinot-s3] from location
[/home/yarn/pinot/plugins/pinot-file-system/pinot-s3] 21/08/20 15:11:23 INFO
PluginManager: Successfully loaded plugin [pinot-s3] from jar files:
[file:/home/yarn/pinot/plugins/pinot-file-
system/pinot-s3/pinot-s3-0.8.0-SNAPSHOT-shaded.jar] 21/08/20 15:11:23 INFO
PluginManager: Successfully Loaded plugin [pinot-s3] from dir
[/home/yarn/pinot/plugins/pinot-file-system/pinot-s3]```  
**@mayanks:** Seems like the yaml file has issues, can you double check on
that?  
**@nisheetkmr:** According to stack trace, the cause is class not found
```Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found:
org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec at
org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:660)
at
org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:335)```
The line of code where it breaks also suggests the same ``` try { cl =
getClassForName(name); } catch (ClassNotFoundException e) { throw new
YAMLException("Class not found: " + name); } protected Class<?>
getClassForName(String name) throws ClassNotFoundException { return
Class.forName(name); }``` This doesn’t looks like a yaml file issue. I had
cross verified yaml file and it looks fine. I have attached the yaml file for
cross verification  
**@nisheetkmr:** Complete stack trace: ```21/08/20 15:11:24 ERROR
LaunchDataIngestionJobCommand: Got exception to generate IngestionJobSpec for
data ingestion job - Can't construct a java object for
tag:,2002:org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec;
exception=Class not found:
org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec in
'string', line 1, column 1: executionFrameworkSpec: ^ at
org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:349)
at
org.yaml.snakeyaml.constructor.BaseConstructor.constructObject(BaseConstructor.java:182)
at
org.yaml.snakeyaml.constructor.BaseConstructor.constructDocument(BaseConstructor.java:141)
at
org.yaml.snakeyaml.constructor.BaseConstructor.getSingleData(BaseConstructor.java:127)
at org.yaml.snakeyaml.Yaml.loadFromReader(Yaml.java:450) at
org.yaml.snakeyaml.Yaml.loadAs(Yaml.java:427) at
org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.getSegmentGenerationJobSpec(IngestionJobLauncher.java:94)
at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:125)
at
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.main(LaunchDataIngestionJobCommand.java:74)
at
lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command
--1:1) at
lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw$$iw.<init>(command
--1:44) at
lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw$$iw.<init>(command--
1:46) at
lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw$$iw.<init>(command--1:48)
at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw$$iw.<init>(command--1:50)
at lined4163e7e07a34bf996c1eea078dbedab25.$read$$iw.<init>(command--1:52) at
lined4163e7e07a34bf996c1eea078dbedab25.$read.<init>(command--1:54) at
lined4163e7e07a34bf996c1eea078dbedab25.$read$.<init>(command--1:58) at
lined4163e7e07a34bf996c1eea078dbedab25.$read$.<clinit>(command--1) at
lined4163e7e07a34bf996c1eea078dbedab25.$eval$.$print$lzycompute(<notebook>:7)
at lined4163e7e07a34bf996c1eea078dbedab25.$eval$.$print(<notebook>:6) at
lined4163e7e07a34bf996c1eea078dbedab25.$eval.$print(<notebook>) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793) at
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054) at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
at
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576) at
scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572) at
com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:202)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:714)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:667)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:202)
at
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:396)
at
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:373)
at
com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at
com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at
com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:275)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:373)
at
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192) at
com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at
com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748) Caused by:
org.yaml.snakeyaml.error.YAMLException: Class not found:
org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec at
org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:660)
at
org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:335)
at
org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.construct(Constructor.java:345)
... 59 more```  
**@npawar:** Are there extra spaces before "executionFrameworkSpec" ? Looks
like it from the message you pasted. Yaml is very particular about spaces  
**@nisheetkmr:** No extra spaces before `executionFrameworkSpec`. I have
attached yaml file as well in above message for cross reference  
**@bruce.ritchie:** Try using --jars instead of extraClasspath  
**@mayanks:** @kulbir.nijjer I recall you recently faced the issue, spark
wasn’t seeing the class. Can we add FAQ around it?  
**@kulbir.nijjer:** @nisheetkmr please try with --jars option first but if
that doesn't help then in your ingestion Spec YAML add entry
`dependencyJarDir` in extraConfigs, where you will need to first copy all the
jars present in <PINOT_HOME>/plugins folder to S3 ```extraConfigs: #
stagingDir is used in distributed filesystem to host all the segments then
move this directory entirely to output directory. stagingDir:
's3://<somepath>/' dependencyJarDir: 's3://<somePath>/plugins'``` and that
should address the CNFE. This is a known issue and we are debugging why jars
are not getting deployed added to YARN nodes, so meanwhile above could be a
short term workaround.  
**@mayanks:** Thanks @kulbir.nijjer  
 **@anu110195:** Hey, Getting this sometimes while making connection from
local  
 **@anu110195:** requests.exceptions.ConnectionError:
HTTPConnectionPool(host='pinot2-controller-external.data2.svc', port=9000):
Max retries exceeded with url: /tables/requests2/schema (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x105fc4280>:
Failed to establish a new connection: [Errno 8] nodename nor servname
provided, or not known'))  
 **@anu110195:** sometimes its works, but most of time it doesn't and throws
above error  

###  _#getting-started_

  
 **@kangren.chia:** am i missing something, but if you have ```# table1  #
table2 ``` do you need to tell controller where segments for each table? i
only see `controller.data.dir`  
**@kangren.chia:** i tried keeping segments for different tables in the same
root dir with different prefixes and it worked  
**@kangren.chia:** although this wasnt obvious when i tried to look in the
docs, maybe i missed where?  
**@g.kishore:** Are you storing the segments on the controller?  
**@kangren.chia:** no, im using deep store  
**@kangren.chia:** i am trying to tweak my settings for fast queries now  
**@kangren.chia:** i haven’t really seen a guide or resource for this, so
trying different segment size / server settings / index types  
**@g.kishore:** Agree, we can definitely improve the docs  
**@kangren.chia:** ah im not complaining here. just explaining the context why
i have different segments for different tables (benchmarking)  
**@mayanks:** I think the current usage model is that you may have controller
data dir as the parent of the two tables in s3. cc @kulbir.nijjer @xiangfu0  
**@tiger:** When using S3 as the deepstore with the
SegmentCreationAndMetadataPush, should the `controller.data.dir` (from the
controller conf) be the same as the `outputDirURI` (from the ingestion
jobspec) ?  
**@kangren.chia:** i’ve tried this and you can have: ```outputDirUri for
ingestion job 1:  outputDirUri for ingestion job 2:  controller.data.dir ```
i.e. `controller.data.dir` works as the root dir.  
 **@kulbir.nijjer:** @kulbir.nijjer has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org