You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <ap...@gmail.com> on 2021/09/28 02:00:21 UTC

Apache Pinot Daily Email Digest (2021-09-27)

### _#general_

**@rajesh.narayan:** @rajesh.narayan has joined the channel
**@rajesh.narayan:** Hey Guys - I am looking to explore Pinot for some of the
use cases. Also looking for enterprise level support.. who can help?
**@atri.sharma:** Please ask your questions here
**@rajesh.narayan:** is there commercial support available for Pinot?
**@atri.sharma:**
**@rajesh.narayan:** Thanks
**@kennybastani:** Rajesh, let me know if you want me to connect you to
someone at StarTree.
**@nageshblore:** I am new to Pinot. I am trying to understand if Pinot query
in a say, Java client () can be made to work similar to KStream example ().
That is, the KStream example does not "loop" to look for new messages to apply
transformations. Whereas, it is not clear to me if the Pinot example will keep
running until stopped. My scenario is as follows. For every new message that
arrives, I want a batch of records between `[current_timestamp - 1 minute,
current_timestamp]` where, `current_timestamp` corresponds to most recently
arrived message. Can a Pinot query client be written to run as soon as a new
message arrives? Thanks.
**@kennybastani:** Sure. You can subscribe to the topic as a Kafka consumer in
your Java application using the Pinot Java Client. You'll need to deal with a
potential race condition of whether or not Pinot has ingested the record into
a real-time table yet. In this case, you can issue a query to check if the
record exists yet in Pinot, and if it does not, create an async thread on a
scheduled executor that retries on a periodic interval until that record is
ingested. From there you can then execute the query that aggregates over your
window.
**@kennybastani:** Does that make sense?
**@meetdesai74:** @meetdesai74 has joined the channel
**@ken:** I thought this was a good write-up of how TimescaleDB works with
approximate percentiles (focusing on time-series data):
**@mayanks:** Thanks for sharing @ken. Just fyi, we do have TDigest based
approximations for percentile in Pinot.
**@ashish:** For distinctcount pinot has apache sketches based implementation.
Apache sketches library also has quantile sketch - any interest in supporting
percentile aggregation based on apache sketches library?
**@rjpatrick:** @rjpatrick has joined the channel

### _#random_

**@rajesh.narayan:** @rajesh.narayan has joined the channel
**@meetdesai74:** @meetdesai74 has joined the channel
**@rjpatrick:** @rjpatrick has joined the channel

### _#troubleshooting_

**@kangren.chia:** is there an API to update the helix config and verify that
the changes have taken place? i’ve looked through the swagger API and ```POST
/cluster/configs``` seems to be the most likely candidate, but it’s not
working for me
**@g.kishore:** which config are you trying to change
**@kangren.chia:** `controller.dimTable.maxSize`
**@kangren.chia:**
**@g.kishore:** some of them are dynamic (no need of restart) and some of them
require restart
**@kangren.chia:** can i restart via the REST api?
**@g.kishore:** this one looks like it requires restarting controller
**@kangren.chia:** > some of them are dynamic (no need of restart) and some of
them require restart how do i differentiate between these 2 categories?
**@g.kishore:** unfortunately, we dont have a good way to do that.. most of
them can be made dynamic if someone asks for it and we see the need in
production
**@g.kishore:** for e.g. this one should have been dynamic
**@g.kishore:** controller already gets notified when config changes
**@g.kishore:** but that the code is reading it only at startup and caching it
**@kangren.chia:** ah ok! let me try restarting it now
**@kangren.chia:** it works, thanks!
**@bajpai.arpita746462:** Hi Everyone, I am trying deduplication in apache
pinot 0.8.0. I have enabled upsert in my REALTIME table and it is working as
expected. But when data moves from realtime to offline table the duplicate
data appears in offline table. Is there a way by which I can have
deduplication enabled in REALTIMEtoOFFLINE flow, so that my offline table
contains only distinct values. Can anyone help me with the same?
**@g.kishore:** upsert is not supported with hybrid table (realtime and
offline mode). It's in the works @yupeng can provide more info. Any reason you
cannot use it in real-time only mode?
**@bajpai.arpita746462:** we need to backfill data that is the reason
**@yupeng:** for backfill you can still write the data to kafka
**@npawar:** For rows that should dedup, will the rows have identical values
across all columns(all metrics, all dimensions and time column as well) ? If
yes, realtime to offline job has a config for dedup
**@kangren.chia:** using a lookup table in a SQL query with group by + having
clause is extremely slow, it looks like this: ```# old (<1s) SELECT user,
count(*) FROM events WHERE time BETWEEN 0 AND 31 AND location BETWEEN 1000 AND
1005 GROUP BY user HAVING count(*) > 10 # new (>10s) SELECT user, count(*)
FROM events WHERE time BETWEEN 0 AND 31 AND location BETWEEN 1000 AND 1005 AND
lookUp(...)=0 GROUP BY user HAVING count(*) > 10```
**@kangren.chia:** without the HAVING clause, query time is ok
**@kangren.chia:** which makes me think putting the fields in the lookup
dimension table in the fact table will be better performance wise
**@kangren.chia:** even if it means much more duplicated data
**@kangren.chia:** at least with the type of queries i’ll like to perform
**@richard892:** denormalising the table for the sake of a common query is
usually a good idea
**@kangren.chia:** im curious what are the use cases of the lookup table
**@richard892:** in any case, it would be good to see a profile of the slow
query
**@kangren.chia:** btw, the google doc links in are locked to the public (am
assuming they were meant to be open)
**@kangren.chia:** @richard892 profiling as mentioned here right?
**@richard892:** > im curious what are the use cases of the lookup table it's
not always possible to denormalise, e.g. if the stream filling the lookup
table can lag behind the stream populating the fact table. It may be the case
that the reference data updates more frequently than the facts (e.g. fx rates
vs transactions) and so on. If the lookup table is static, I would just
denormalise and have fast queries.
**@richard892:** yes, often perf issues like this can be resolved by
configuration, but getting profiles for slow queries is generally useful
because it helps discover and address "unknown unknowns" - there may be low
hanging fruit here, and the profile will probably find it if it's there.
**@richard892:** @mayanks or @jackie.jxt might have some advice on tuning the
HAVING clause later
**@kangren.chia:** thanks richard!
**@jackie.jxt:** @kangren.chia Does this query have small latency? ```SELECT
user, count(*) FROM events WHERE time BETWEEN 0 AND 31 AND location BETWEEN
1000 AND 1005 AND lookUp(...)=0 GROUP BY user```
**@jackie.jxt:** Using `lookUp` within the filter is quite expensive because
Pinot won't be able to utilize index to solve the query, but have to scan and
lookup each value. The latency of this query with or without HAVING should be
similar
**@jackie.jxt:** FYI, if you have more than 10 users matching the filter, you
should add an `ORDER BY count(*) DESC` to get the accurate result. Pinot won't
keep all the groups by default in order to reduce the memory usage
**@valentin:** Hello, I’m using RealtimeToOfflineSegmentTask to move segments
from REALTIME tables to OFFLINE ones, it works pretty well on some of my
tables, but 2 of them are ignored. I can’t find anything in the logs. If I
check the value of ```<cluster
name>/PROPERTYSTORE/MINION_TASK_METADATA/RealtimeToOfflineSegmentsTask/<table
name>``` in Zookeeper, I see that the document hasn’t been updated since
August 2nd. And I can’t found any task related to this table in ```<cluster
name>/CONFIGS/RESOURCE/TaskQueue_RealtimeToOfflineSegmentsTask_Task_RealtimeToOfflineSegmentsTask_1632701552537```
How is this possible? How can I reset this task? There is somewhere I can look
for infos to debug this and prevent it from happening again? I’m using Pinot
0.7.1 (I plan to upgrade soon but I can’t right now)
**@g.kishore:** can you check the min/max of the time column in each segment
**@valentin:** in the REALTIME table?
**@valentin:** ```start: 1632454589759 end: 1632541000579 start: 1632541000917
end: 1632627410395 start: 1632627410714 end: 1632713820512``` for completed
segment in the realtime table
**@valentin:** and the task config the in table is: ```"task": {
"taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod":
"2d", "bufferTimePeriod": "1d", "collectorType": "concat",
"maxNumRecordsPerSegment": "390000" } } },```
**@npawar:** Check for logs in the controller. There should be a log about
trying to schedule the task
**@sirsh:** Hello... I am trying to get started with a batch/offline
ingestion ```../apache-pinot-0.8.0-bin/bin/pinot-ingestion-job.sh -jobSpecFile
./pinot_ingest_samples/batch_ingestion_no_comment.yaml ``` Using this file
```executionFrameworkSpec: name: 'standalone'
segmentGenerationJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName:
'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
jobType: SegmentCreationAndUriPush inputDirURI: '' includeFileNamePattern:
'glob:**/*.parquet' outputDirURI: '' overwriteOutput: true pinotFSSpecs: \-
scheme: s3 className: org.apache.pinot.plugin.filesystem.S3PinotFS configs:
region: 'us-east-1' recordReaderSpec: dataFormat: 'parquet' className:
'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec:
tableName: 'sample' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: \-
controllerURI: '' pushJobSpec: pushAttempts: 2 pushRetryIntervalMillis: 1000
segmentUriPrefix: '' segmentUriSuffix: ''``` But i get a class not found
exception `java.lang.RuntimeException: Failed to create IngestionJobRunner
instance for class -
org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner`
**@sirsh:** I am wondering if there are some Java nooob things i need to do
like setting certain env vars, classpaths or something. I expected to just
download the binary for pinot, run the script with an input file, and for it
to just work but not so.
**@sirsh:**... or maybe i should use docker locally for testing as well as
when deploying agents to K8s and it will make life easier - i can do that too?
**@richard892:** which JDK version are you using?
**@richard892:** there was a regression in 0.8.0 which has been fixed by @ken
which might relate to this
**@sirsh:** My java version output is ```java version "11.0.12" 2021-07-20 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.12+8-LTS-237) Java
HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.12+8-LTS-237, mixed mode)```
**@richard892:** thanks, I'm not sure the fix applies then.
**@sirsh:** Ok thanks Richard. Ill try the same thing in docker later -
running locally is not something i really need to do except to get that warm
fuzzy feeling when it works for the first time
**@sirsh:** Trailing question; is there another way for me to use the REST
interface on the controller to ingest data. Really all i want is some
scheduled way to move data from parquet files on S3 to table segments ? I
guess this is pretty basic.
**@ken:** Hi @sirsh - I’d suggest asking that question on the
<#CDRCA57FC|general> channel.
**@rajesh.narayan:** @rajesh.narayan has joined the channel
**@meetdesai74:** @meetdesai74 has joined the channel
**@will.gan:** Hi, I'm trying to query json like done , i.e. without using
functions. I have a JSON column with a json index on it, and I ingest it using
a `jsonFormat` ingestion transform. When I select the column it shows the json
as a string, but when I try to select things like `json_column[0].name` it
returns empty. Anyone know the issue? Thanks in advance!
**@rjpatrick:** @rjpatrick has joined the channel

### _#thirdeye-pinot_

**@vaibhav.mital:** @vaibhav.mital has joined the channel
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org