Apache Pinot Daily Email Digest (2021-11-03)

### _#general_

 **@nair.a:** HI Team, We are doing pinot poc for offline ingestions
currently. Currently facing an issue , while ingesting segment from s3 to
pinot. ```2021/11/03 08:29:22.109 INFO [SegmentFetcherFactory]
[HelixTaskExecutor-message_handle_thread] Segment fetcher is not configured
for protocol: s3, using default 2021/11/03 08:29:22.109 WARN
[PinotFSSegmentFetcher] [HelixTaskExecutor-message_handle_thread] Caught
exception while fetching segment from:  to:
java.lang.IllegalStateException: PinotFS for scheme: s3 has not been
initialized at
dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at
dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808]``` Following
are our conf: Server conf: pinot.server.instance.enable.split.commit=true
Controller conf: controller.local.temp.dir=/tmp/pinot/
 **@karinwolok1:** :speaker: This conference is still accepting speaker
submissions!!! Should be a good one. If you have a good story about your
Apache Pinot use case, please submit here :speaker:  
**@greyson:** Coming from a relational database perspective, I've had some
difficulty conceptualizing what my data might look like in Pinot. Is the
standard to have multiple tables like in a RDBMS and query them relationally
using something like Presto, or should I strive to have less tables with more
columns that remove the need for relational querying? If the latter is
preferable, is that still the case when the table would have to contain many
columns to replace the relational structure and many of those columns would
need to contain things like array or JSON object  
**@bobby.richard:** Definitely the latter  
**@bobby.richard:** I am new to Pinot as well, but from what I understand
wide, denormalized tables are the norm  
**@tyler773:** Thanks @bobby.richard! So it's preferable, within reason, to
have duplicated data, to some extent, since the size of the tables in terms of
rows are less of an issue in terms of query speed than they would be in an
**@mayanks:** Thanks @bobby.richard, yes that is the more common usage. Having
said that, Pinot does support lookup joins (on dimension table). And folks
have also used Presto/Trino connector for Pinot to do more complex queries
(joins/nested queries etc)  
**@mayanks:** @tyler773 yes that is correct. Pinot is built for performance,
and can scale very well with size of data (num rows, or otherwise)  
**@greyson:** And it's still best practice even when those columns become more
complicated? Like would it be a problem to have an array-column with 100
entries in it? What about 1000? 10,000? Is it still preferable to have that
data be stored in a column at that point instead of in its own table and
relationally joined? Or, and I assume this is not the right answer, but is a
middle-ground solution to just duplicate data across rows to avoid large array
column values?  
**@ken:** Hi @greyson - due to how Pinot can use a dictionary to compress
columnar data, “duplicate data across rows” typically doesn’t add a lot to the
size of the table, or at least that’s been our experience with having
denormalized tables.  
**@greyson:** So then, @ken, would it be a good idea to have multiple "rows"
with duplicated data and a single value column instead of one row with an
array column?  
**@ken:** If nothing else is changing but the value in that one column, then
we use an MV (multi-value) column and have a single row.  
**@ken:** e.g. we have a column with the unique terms, derived from another
column containing a blob of text. That’s stored as a MV column, and we can
easily query against those terms to filter to a sub-set of rows.  
**@greyson:** Our pipeline at present is that we have a single immutable data
type represented in a base table, and then through multiple steps in our
processing pipeline we add data to various tables that relate to the base/core
table. When you say "If nothing else is changing but the value in that one
column" are you implying that the rest of the columns should be largely
immutable as well?  
**@ken:** If you have say two MV columns A & B, and you’ve collapsed multiple
row values into those two columns, then you’ve lost the ability to filter to
rows where column A = x and column B = y, since those values could have come
from two different pre-collapsed rows. But it sounds like your use case is
different, in that you’re adding additional attributes to a base row, thus
there’s no row collapsing going on.  
**@greyson:** Awesome, thanks for your input :slightly_smiling_face:  
**@g.kishore:** this is such an amazing thread. Thanks Ken!  
 **@diogo.baeder:** Just a random comment/praise: the Pinot open source
community support is amazing! Thanks for that, guys! I'm looking forward for
my next steps in using it in production :heart:  
**@mayanks:** Thanks so much for the kind words @diogo.baeder, would love to
see you take your use case to production using Apache Pinot.  
**@diogo.baeder:** I'll make sure we have some sort of blog post or video or
similar, on the matter. :slightly_smiling_face:  
**@mayanks:** That would be amazing :pray:  
 **@ashish:** Pinot does not support “NOT” operator and there is no
regexp_not_like. So is there any way to do the equivalent of “NOT
regexp_like(…,…) at all in Pinot?  
**@jackie.jxt:** Not currently. We should add `NOT` operator support to pinot.
Could you please file an issue about this?  
 **@gqian3:** Hi team, is there a Pinot query to find out when is the last
ingest time of a offline table?  
**@mayanks:** You mean the time when the segment was pushed or the max value
of time column  
**@mayanks:** If latter you can just do sql select max(timeCol)  
**@gqian3:** I mean when the segment are pushed.  

###  _#troubleshooting_

**@adireddijagadesh:** @nair.a you could use this link and check whether you
configured Injection job correctly:  If it’s still occurring can you please
share the `ingestionJobSpec.yaml`  
**@kchavda:** Not sure if it matters but I see following missing on the
controller conf. (I am running docker containers). ```pinot.role=controller controller.zk.str=pinot-
zookeeper:2181 controller.port=9000```  
**@nair.a:** Hey @adireddijagadesh, sharing jobspec ```executionFrameworkSpec:
name: 'standalone' segmentGenerationJobRunnerClassName:
jobType: SegmentCreationAndMetadataPush inputDirURI: '' outputDirURI: ''
overwriteOutput: true pinotFSSpecs: \- scheme: s3 className:
org.apache.pinot.plugin.filesystem.S3PinotFS configs: region: 'us-east-1'
recordReaderSpec: dataFormat: 'parquet' className:
'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec:
tableName: 'my_table' pinotClusterSpecs: \- controllerURI: '' pushJobSpec:
pushParallelism: 2 pushAttempts: 1 ```  
**@kchavda:** You're able to hit S3 from that box? Using env to pass in access
key and secret?  
**@nair.a:** yes we are able to access s3 from the server,  
**@kchavda:** I'm comparing what you've shared with working versions of my
jobspec and conf files to read CSV files from S3. I noticed the jobspec is
missing schemaURI and tableConfigURI under tableSpec. And the server conf. is
missing ```pinot.server.netty.port=8098 pinot.server.adminapi.port=8097
pinot.server.instance.segmentTarDir=/tmp/pinot-tmp/server/segmentTars``` Not
sure if these things are directly causing the errors but you can update and
give it a shot.  
**@nair.a:** Hey @kchavda,we have the above configs in server and controller.
The ingestion is completing with success, but status of the ingested segment
is showing as BAD. and upon checking the logs of server, we found this error.
Will try to provide additional configs as you mentioned.  
**@nair.a:** Hey @kchavda, till the same error. can i know how you are setting
the aws key and secret in server conf?  
**@adireddijagadesh:** @nair.a You could set in controller config as
**@adireddijagadesh:** Refer this link for more info and different ways of
**@kchavda:** I followed the tutorial and found it to be very helpful. I also
passed the aws key and secret when starting the containers (controller,
broker, server, ingestion job): ``` docker create -ti \ \--name pinot-server \
\--network=pinot-demo \ \--env AWS_ACCESS_KEY_ID= \ \--env
AWS_SECRET_ACCESS_KEY= \ -e JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins
-Xms32G -Xmx32G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-
server.log" \ \--mount type=bind,source=/opt/pinot,target=/tmp \
apachepinot/pinot:0.8.0 StartServer \ -zkAddress pinot-zookeeper:2181 ```  
**@nair.a:** Okay, will try this. Currently we did set creds inside server
conf, but not in controller conf.  
**@luisfernandez:** I have been asking around but is there any desire to make
pinot pagination work with group by? my current use case kinda would need
**@g.kishore:** yes.  
**@g.kishore:** this feature request is quite hot.. we will do it!  
**@luisfernandez:** oh this is greatt, are there any plans in place, like
timelines or what not or not really just want to have a sense  
**@g.kishore:** Plan is to get it done by Jan..  
**@g.kishore:** Contributions welcome..  

###  _#pinot-dev_

 **@atri.sharma:** What's the process to update docs for new features?  
 **@g.kishore:** GitBook  
 **@atri.sharma:** Please point me to the link and I will get it done right
 **@walterddr:** tip of master seems broken, looking into it unless someone
else already on it  
###  _#getting-started_

 **@tyler773:** Been trying to just start Pinot locally in a docker container.
I'm using pinot version `0.8.0` and `openjdk:11` . I'm on a mac. I'm trying to
start the cluster by using the pinot admin commands `StartZookeeper`
`StartController` `StartBroker` and `StartServer` as shown in the getting
started. However inevitably the controller will go down before I can start the
Broker and the Server with this error: `Expiring session 0x100080c84b20005,
timeout of 30000ms exceeded` , Is there a way to avoid this?  
**@g.kishore:** Please check the jvm memory params  
**@tyler773:** @g.kishore will do, thank you!  
