You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/09/21 02:00:16 UTC

Apache Pinot Daily Email Digest (2021-09-20)

### _#general_

  
 **@dadelcas:** Hello there, the docs say that a shared volume is required for
controllers if more than one is to be deployed. Can someone shed some light as
why this is needed instead of each controller having its own storage? Will all
the controllers be active? Will they all write to the volume simultaneously?
Any considerations we should take into account in a multi-controller
environment? My deployment is on k8s  
**@mayanks:** The shared volume is used for storing golden copy of the data as
it is pushed. Typically you want to configure a deepstore (eg S3) for this
purpose  
**@dadelcas:** Cool, so in fact all the controllers will be active if I
understand what you're saying, is this correct?  
**@mayanks:** Yes all controllers will be active to provide fault tolerance.
So in case one goes down you will not have unavailability  
**@dadelcas:** Thanks for confirming :+1:  
 **@zineb.raiiss:** Hello Freinds, can you help me please, If I turn off the
PC everything I did is gone. So If I connect to my machine how can I run
ThirdEye? I already install it and go to the first page but now i want to re-
run it but i don't know how  
 **@salkadam:** @salkadam has joined the channel  

###  _#random_

  
 **@salkadam:** @salkadam has joined the channel  

###  _#troubleshooting_

  
 **@bajpai.arpita746462:** Hi everyone , I am trying to run the spark
ingestion job with apache pinot 0.8.0 in my own cluster setup. I am able to
run the standalone job, but when am trying to run spark ingestion job it is
giving me following error: java.lang.RuntimeException: Failed to create
IngestionJobRunner instance for class -
org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner
I am using the below command to run the spark job: ${SPARK_HOME}/bin/spark-
submit --class
org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --master
"local[2]" --deploy-mode client --conf
"spark.driver.extraJavaOptions=-Dplugins.dir=${PINOT_DISTRIBUTION_DIR}/plugins
-Dlog4j2.configurationFile=${PINOT_DISTRIBUTION_DIR}/conf/pinot-ingestion-job-
log4j2.xml" --conf
"spark.driver.extraClassPath=${PINOT_DISTRIBUTION_DIR}/lib/pinot-
all-${PINOT_VERSION}-SNAPSHOT-jar-with-dependencies.jar"
local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-${PINOT_VERSION}-SNAPSHOT-jar-
with-dependencies.jar -jobSpecFile
${PINOT_DISTRIBUTION_DIR}/examples/batch/transcript/transcript_local_jobspec.yaml
I am also attaching screenshot of the error and my job spec file for better
understanding. Could anyone please help with the same ?  
**@mayanks:** What version of spark? If 3.x, may be try 2.x  
**@ken:** If you’re running Pinot 0.8, then I think this is a known regression
(from 0.7.1). See , which has some details of how @kulbir.nijjer worked around
this using Spark’s support for `dependencyJarDir`. There’s a PR () to fix
this, which works for Hadoop, haven’t tried with Spark.  
 **@zineb.raiiss:** Hello Freinds, can you help me please, If I turn off the
PC everything I did is gone. So If I connect to my machine how can I run
ThirdEye? I already install it and go to the first page but now i want to re-
run it but i don't know how.  
**@mayanks:** Hi there is a separate slack workspace for TE which might get
you faster response, cc @pyne.suvodeep  
**@pyne.suvodeep:** @zineb.raiiss I think you are using the default h2 db
which is in memory. My suggestion would be to install Mysql 5.7 and use
thirdeye with it. This doc is a bit out of date but should still help  
**@zineb.raiiss:** @pyne.suvodeep can you add me in workspace for TE? this is
my e-mail:  
**@npawar:** sent you invite for TE  
**@zineb.raiiss:** Ooooh I received it, thank you so Much Neha:smiley:  
 **@salkadam:** @salkadam has joined the channel  
 **@luisfernandez:** if you use star tree index and say you have a time
column, if you wanted to do different aggregations based on time, does that
mean that the time column also has to be part of the indexes for the star
tree? say user_id , click_count, serve_time then I should do user_id,
serve_time and then SUM(click_count) as part as the aggregation  
**@mayanks:** In the default config it is already added:  
**@g.kishore:** Short answer - yes time should be per of the index  

### _#pinot-dev_

  
 **@yuchaoran2011:** @yuchaoran2011 has joined the channel  

###  _#getting-started_

  
 **@salkadam:** @salkadam has joined the channel  
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org