You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/09/29 02:00:20 UTC

Apache Pinot Daily Email Digest (2021-09-28)

### _#general_

**@gauravjindal25:** I have a question on Pinot. In our company we are
capturing user behavioral data as events on our application. Since it’s a
clickstream data we are looking for a platform that can enable adhoc query on
this data with low latency. Is Pinot a recommended option? Has anyone tried
it? Product Analytics tool such as Amplitude, Mixpanel etc also enable real
time analytics on clickstream data. So wondering if Pinot has a similar or
better technology to do adhoc analysis on event data
**@mayanks:** When you say ad hoc, do you mean more dynamic slice and dice? If
so, yes, Pinot is a good option
**@gauravjindal25:** yes. Anyone tried and willing to show demo?
**@dadelcas:** Hello, it looks like minion doesn't seem to expose any pinot
metric at the moment - release 0.8.0. I'm interested in monitoring task
failures, is there an easy way to do this other than monitoring logs?
**@mayanks:** We have added some debug apis in 0.8. @ramabaratam are the
minion debug apis 0.8?
**@ramabaratam:** Yes. ```/tasks/{taskType}/taskcounts Fetch count of sub-
tasks for each of the tasks for the given task type /tasks/{taskType}/debug
Fetch information for all the tasks for the given task type
/tasks/task/{taskName}/debug Fetch information for the given task name```
**@dadelcas:** The problem is I need to integrate with existing monitoring
tools, not sure if this API will help me anyhow. Cheers
**@ramabaratam:** i see these MinionMeter metrics - which will translate to
Prometheus pinot_minion_numberTasksExecuted etc. with label "id" as taskType .
``` NUMBER_TASKS_EXECUTED("tasks", false), NUMBER_TASKS_COMPLETED("tasks",
false), NUMBER_TASKS_CANCELLED("tasks", false), NUMBER_TASKS_FAILED("tasks",
false), NUMBER_TASKS_FATAL_FAILED("tasks", false); ```
**@mayanks:**
**@ramabaratam:** You might want to look at MinionQueryPhase::
```TASK_EXECUTION``` for total execution time for tasktype. will translate to
Prometheus timer metric as pinot_minion_taskExecution...
**@mayanks:** @dadelcas ^^ seems like we might have some metrics that you can
already use currently.
**@dadelcas:** I've also got a question about partitioning in hybrid tables.
If I understand correctly this only applies to offline tables. Does
`segmentPartitionConfig` play together with the time column? The field only
accepts 1 value at the moment and I was wondering whether segments are
generated using `timeColumnName` and further partitioned using the
`segmentPartitionConfig`? If I don't specify partitioning then the segments
are effectively replicated to all the servers?
**@mayanks:** This config helps partition data on a primary key. You typically
don’t want to choose time column for this, but a column that appears on most
queries with equality predicate. If you don’t use any partitioning or replica
group based assignment, then each segment can go to any of the servers (n
copies).
**@dadelcas:** Thank you Mayank. So the attribute
`segmentConfig.timeColumnName` is not used in an offline table?
**@dadelcas:** I guess if I need to partition my data, let's say, by hour then
I need to inject a new field in my data and use it the partition config of the
offline table
**@mayanks:** No no, I am saying time column already gets special treatment in
terms of time based pruning already. If you partition your data on time
column, you don't need to explicitly tell Pinot. The `segmentPartitionConfig`
is more for a primary key based partitioning.
**@dadelcas:** Got it, thanks for clarifying that
**@sirsh:** Hello - is there a way to use the controller's REST interface to
submit OFFLINE table ingestion tasks? Details are; • Parquet files on S3 •
Have created a schema and table spec already on Pinot (which is deployed on
K8s (Argo/Helm) Have seen there is an ingestion task that can be triggered
e.g. using the scripts/utils in the pinot distribution but would like to
directly do this using REST commands. I would like to apply the strategy of
`SegmentCreationAndUriPush` - and i would either setup something that is
running on a daily or hourly schedule OR just trigger once off tasks myself.
Either works.
**@g.kishore:**
**@g.kishore:** ```curl -X POST " &batchConfigMapStr={ "inputFormat":"json",
"input.fs.className":"org.apache.pinot.plugin.filesystem.S3PinotFS",
"input.fs.prop.region":"us-central", "input.fs.prop.accessKey":"foo",
"input.fs.prop.secretKey":"bar" } &sourceURIStr="```
**@sirsh:** Thank you @g.kishore - the docs suggest this is for small files or
testing because files need to be downloaded - or am i reading this wrong?
**@mayanks:** The ingestionFromURI endpoint is for a quickstart kind of setup.
**@mayanks:** Do you mean you want to schedule and control the job that can
generate and push segmetns?
**@sirsh:** yes - that is exactly what i want to do.
**@g.kishore:** we dont have this now but we are thinking of a simple
solution.. can you please file a ticket.. I will add my thoughts to that
**@barana:** @barana has joined the channel
**@sirsh:** Hello... i have a question related to Kafka and SSL specifically.
I submitted schema and REALTIME table specs but i can see that my SSL
configuration is not correct. I would like to understand for a standard
deployment of Pinot using Helm on K8s where i would expect the SSL cert
location to be to i can configure SLL correctly for my table - adding a
segment of my spec to the thread
**@sirsh:** ```"tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": {
"streamType": "kafka", "security.protocol": "SSL", "ssl.truststore.location":
"/opt/pinot/kafka.client.truststore.jks", "stream.kafka.topic.name": "MY-
TOPIC", "stream.kafka.consumer.type": "lowlevel",
"stream.kafka.consumer.prop.auto.offset.reset": "largest",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
"realtime.segment.flush.threshold.rows": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.segment.size": "100M", "stream.kafka.zk.broker.url":
"", "stream.kafka.broker.list": "", "schema.registry.url": "kafka-schema-
registry-cp-schema-registry.kafka-schema-registry.svc.cluster.local:8081",
"stream.kafka.decoder.prop.schema.registry.rest.url": "kafka-schema-registry-
cp-schema-registry.kafka-schema-registry.svc.cluster.local:8081" } }```
**@sirsh:** @mayanks
**@mayanks:** Hi @sirsh here's some info if that helps:
**@sirsh:** Thanks @mayanks - i actually read this one but was unsure of some
actual values to use. I can keep experimenting. For example what should the
`ssl.truststore.location` be and do i need to do any setup
**@mayanks:** @slack1 ^^
**@slack1:** Hi @sirsh - afaik the Helm template wasn’t updated yet to handle
the injection of SSL certs and keystores. The fastest way to get you to a
working setup would be to create a configmap with prep’d keystore/truststore
and then hack the deployment spec to include them as local volumes on the path
you set up above. If you build out a more generic solution, we’d be very glad
to include them in the chart as a contribution. There’s just so much to do
around Pinot right now.
**@sirsh:** thats very useful to know - makes sense, thanks both!
**@weixiang.sun:** What does “disabling the realtime table” mean? No
Streaming Ingestion and No query served? I do not see any specific document
about it.
**@mayanks:** Disabling the table means stopping both. May I ask what
specifically are you looking for?

### _#random_

**@barana:** @barana has joined the channel

### _#troubleshooting_

**@yash.agarwal:** I am trying to setup a new pinot cluster. I have a
zookeeper cluster up. When I try to get the first pinot controller up, it gets
up, and then fails with the an error ```Pinot Controller instance
[Controller_piclx1001.hq.target.com_9000] is Started... Started Pinot
[CONTROLLER] instance [Controller_piclx1001.hq.target.com_9000] at 13.884s
since launch Shutting down Pinot Service Manager with all running Pinot
instances... Trying to stop Pinot [CONTROLLER] Instance
[Controller_piclx1001.hq.target.com_9000] ... Stopping controller periodic
tasks Stopping periodic task scheduler . . Instance
piclx1001.hq.target.com_9000 is not leader of cluster PinotCluster due to
exception happen when session check
org.I0Itec.zkclient.exception.ZkInterruptedException:
java.lang.InterruptedException at
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1192)
~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-
ffcf9b991431067c834bd4fb56fd7641c7fec172] at
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1326)
~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-
ffcf9b991431067c834bd4fb56fd7641c7fec172] at
org.apache.helix.manager.zk.zookeeper.ZkClient.readData(ZkClient.java:1318)
~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-
ffcf9b991431067c834bd4fb56fd7641c7fec172] at
org.apache.helix.manager.zk.ZkBaseDataAccessor.get(ZkBaseDataAccessor.java:320)
~[pinot-all-0.9.0-SNAPSHOT-jar-with-dependencies.jar:0.9.0-SNAPSHOT-
ffcf9b991431067c834bd4fb56fd7641c7fec172] . . Closing zkclient:
State:CONNECTED Timeout:30000 sessionid:0x1006d21ecf40000
local:/10.59.116.133:53916 remoteserver: lastZxid:17179869383 xid:1154
sent:1157 recv:1199 queuedpkts:0 pendingresp:0 queuedevents:0 Session:
0x1006d21ecf40000 closed```
**@jackie.jxt:** Did you explicitly shut down the controller or it shuts down
itself? The error happens after the shut down
**@karinwolok1:** Hey hey! :wave: :speaker: :speaker: :speaker: We're looking
for presenters for the Apache Pinot :wine_glass: meetup!!!! :smiley: :brain:
Anyone have any topics they're interested in presenting or have ideas for
topics you'd like to see, please DM me! :email:
**@barana:** @barana has joined the channel
**@zineb.raiiss:** Hello, I want to create a new table on Pinot, on my data
source I don't have a column for the time I only have STRING type columns. I
created the schema, I made the config table file but when running I got this
error do you have an idea or a solution
**@zineb.raiiss:** executing command: AddTable -tableConfigFile /tmp/pinot-
quick-start/tools-table-offline.json -schemaFile /tmp/pinot-quick-start/tools-
schema.json -controllerProtocol http -controllerHost 192.168.1.105
-controllerPort 9000 -user null -password [hidden] -exec Sending request: to
controller: YD-5CG1182FLG, version: Unknown {"code":400,"error":null}
**@xiangfu0:** can you share the table conf and schema file?
**@zineb.raiiss:** I solved the problem, I added "replication": "1" in my
config table
**@gqian3:** Hi team, we are seeing some Pinot query with avg function
returning -Infinity when the where clause returns no records, is there a way
to modify the query to return Null for this case?
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org