You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pinot.apache.org by Pinot Slack Email Digest <sn...@apache.org> on 2021/01/23 02:00:15 UTC

Apache Pinot Daily Email Digest (2021-01-22)

### _#general_

**@niels.it.berglund:** @niels.it.berglund has joined the channel
**@aishee:** @aishee has joined the channel
**@mikexia:** @mikexia has joined the channel
**@humengyuk18:** Hi team, when using upsert with realtime table, can we do
segments compaction or merge for committed segments? Like merge multiple small
segments into one large segment. If not, how should we deal with too many
small segments when using upsert? @yupeng @jackie.jxt
**@yupeng:** upsert in pinot does not use compact, but use metadata to track
the records of the same key. you can find the design details in this doc
**@yupeng:** segment size can be controlled via threshold, which is separate
**@humengyuk18:** I see, so in the current design, there is no way to merge
multiple small segments into larger segments for upsert table? We can only
control segments size during ingestion?
**@yupeng:** that’s right. take a look at these configs
**@yupeng:** ```"realtime.segment.flush.threshold.size": "0",
"realtime.segment.flush.threshold.time": "24h",
"realtime.segment.flush.desired.size": "50M",```
**@jackie.jxt:** These are two separate topics. Segment merge is not supported
yet, and the feature is on our roadmap
**@jackie.jxt:** Once segment merge is supported, we should be able to merge
segments for upsert table
**@wrbriggs:** Do star-tree index aggregations span segments, or are the
aggregates computed per segment?
**@g.kishore:** per segment
**@g.kishore:** have been thinking of extending it to span multiple segments -
do you have a concrete case?
**@wrbriggs:** Awesome, thank you. So theoretically, I should be able to make
use of star-tree aggregates without a time column, and Pinot will (or could)
merge the aggregates on relevant segments after partition pruning?
**@g.kishore:** yes
**@wrbriggs:** My use case actually is better if they are per-segment
**@vadlamani1729:** @vadlamani1729 has joined the channel

### _#random_

**@niels.it.berglund:** @niels.it.berglund has joined the channel
**@aishee:** @aishee has joined the channel
**@mikexia:** @mikexia has joined the channel
**@vadlamani1729:** @vadlamani1729 has joined the channel

### _#troubleshooting_

**@niels.it.berglund:** @niels.it.berglund has joined the channel
**@aishee:** @aishee has joined the channel
**@contact:** Hey, quick question: we have realtime segment marked as
completed and we would like to move it to a offline table however the endpoint
to download the segment (`get /segments/{tableName}/{segmentName}` )is trying
to fetch it from the deep store. I was just thinking of downloading it and
upload it on the offline table directly, how could i achieve this ? Thanks
**@wrbriggs:**
**@wrbriggs:**
**@wrbriggs:** @contact This should be possible using a minion based on the
second link
**@contact:** Yeah i saw that but i don't really want the minion to rebuild
the segment, i just want to move them as-is
**@contact:** Is there a way to do this @wrbriggs ?
**@contact:** I'm actually deep into how the task works and i'm seeing
`realtimeSegmentZKMetadata.getDownloadUrl()` which i didn't follow yet
**@contact:** but i guess i could find my response there ?
**@wrbriggs:** I’m not sure - the tricky part is that the offline table might
not have the same partitioning / sorting / indexing as the realtime table, and
the *`RealtimeToOfflineSegmentsTask`* handles that generically for you - by
simply moving the segments as-is, you are kind of shoehorning yourself into
never diverging the offline table.
**@contact:** Well i configured the realtime table the exact same as the
offline so i should be fine right ?
**@wrbriggs:** For the short-term, yes
**@contact:** What would be the problem in the long term ? I mean if i have an
issue i can just re-index the segment and re-upload it ?
**@wrbriggs:** It’s not an approach I would put into production, though
**@contact:** From my comprehension the only difference between realtime and
offline when a segment is completed would be that the realtime stores it
locally but the offline does it in the deepstore
**@contact:** I guess i'm missing something ?
**@wrbriggs:** Realtime tables also push segments to the deep store
**@contact:** Hmmm, thats surely something i missed
**@contact:** Is it automatic when the segment is completed ?
**@wrbriggs:** :
**@contact:** Thanks, something that i haven't mentionned is that my stream
are high level. I'm seeing in the task code that it only works with low level
**@wrbriggs:** See this as well:
**@wrbriggs:** Ah, I have no experience using the high level stream consumer,
unfortunately. I went straight to low-level based on the limitations of the
high level streams.
**@contact:** Thanks anyway, you were very helpful
**@npawar:** We don't recommend or maintain high level any more. Any
particular reason you are using high level?
**@npawar:** Simply moving segments to offline table by downloading and
reuploading can give you incorrect results in the time boundary calculation at
the brokers
**@contact:** @npawar We are currently using GCP's pubsub system for pinot and
it doesnt have any "partition" system
**@contact:** You only get one subscription for every consumer
**@mikexia:** @mikexia has joined the channel
**@wrbriggs:** I’m running into a situation where Pinot is using a star-tree
index to satisfy a query in one case, but not in another, and the queries are
almost identical. This one does not use the star tree: ```SELECT dimension,
SUM(metric) AS totalMetrics FROM myTable WHERE otherDimension='filterValue'
AND eventTimestamp >= cast(now() - 172800000 as long) GROUP BY 1 ORDER BY 2
DESC LIMIT 10``` This one uses the star tree: ```SELECT dimension, SUM(metric)
AS totalMetrics FROM myTable WHERE otherDimension='filterValue' AND
eventTimestamp >= 1611161288000 GROUP BY 1 ORDER BY 2 DESC LIMIT 10``` It
looks like the use of a dynamically-computed timestamp value is confusing the
optimizer somehow? the `eventTimestamp` column is not part of my star-tree
index in either case.
**@mayanks:** How did you find out that StarTree was used for one query?
**@wrbriggs:** Trace
**@mayanks:** You are right, the code to check if star-tree can be used or not
does not handle the expression `cast(now() - 172800000 as long)`. You can file
an issue for the same.
**@wrbriggs:** @mayanks I’m new to the Pinot codebase, but if you can point me
in the right direction, I’d be happy to see if I can put together a PR to
address this.
**@mayanks:** Nice
**@mayanks:** Let me post a pointer here
**@mayanks:**
**@mayanks:** Note, after optimization lhs and rhs for the time predicate are
swapped
**@wrbriggs:** @mayanks Correct me if I’m wrong, but based on a quick look, it
appears that the problem might actually be in `extractPredicateEvaluatorsMap`
, and not in `isFitForStarTree`?
**@mayanks:** yes
**@mayanks:** `isFitForStarTree` is the high level entry point for you to get
the algorithm. The code that passes the predicateColumns is the one that needs
to be checked.
**@wrbriggs:** Got it, perfect, and thank you.
**@mayanks:** You'll jump a few classes from this entry point to go to the
exact place that needs fix, but you'll get a better picture of what is going
on
**@wrbriggs:** Yup, looks like the various Aggregation*PlanNode instances.
I’ll spend some time on it after I get done my day job, thanks again!
**@mayanks:** They'll probably lead you to
```StarTreeUtils.extractPredicateEvaluatorsMap```
**@wrbriggs:** Yeah, they did, that was what I referenced above
:slightly_smiling_face:
**@mayanks:** Oh yeah you did
**@wrbriggs:** @jackie.jxt Saw you commented on the time column predicate
issue… wanted to tag you here for context.
**@jackie.jxt:** @wrbriggs I think the problem here is that `1611161288000` is
in micros instead of millis
**@jackie.jxt:** Then everything is pruned out
**@mayanks:** ```I did a simple test to compile the query to BrokerRequest,
and do see cast(now() - 172800000 as long) as LHS for predicate. My test did
not actually go through any query execution.```
**@wrbriggs:** @jackie.jxt 1611161288000 is ms since epoch
**@jackie.jxt:** Oh, sorry my bad
**@jackie.jxt:** The problem is not how the predicate is evaluated, but why
the first query is not converted to the second query on the broker side
**@mayanks:** Because of use of a function (now()) as opposed to a constant
eval? (just guessing)
**@jackie.jxt:** Seems working without the cast
**@wrbriggs:** yes, it seems to be the cast. The only reason the cast is
necessary is being addressed here:
**@mayanks:** Yeah, I am working with @amrish.k.lal on ^^
**@wrbriggs:** Yes, I was just thanking him earlier because I noticed that PR
:slightly_smiling_face:
**@wrbriggs:** I didn’t realize these were so closely intertwined, though
**@mayanks:** I have concerns with this PR, so we would probably find other
ways to fix the problem this PR is trying to address.
**@jackie.jxt:** The cast is not invoked on broker side because cast is not
registered as a scalar function
**@wrbriggs:** So presumably there’s something buried in the
`PinotQuery2BrokerRequestConverter` (or thereabouts) that is reifying eligible
expressions into constants (e.g., `now()`) before handing the query off from
the Broker for execution, and it isn’t handling `cast` function expressions?
**@wrbriggs:** ah
**@jackie.jxt:** We can add a scalar function for cast, then it should work on
broker side (compile time)
**@pabraham.usa:** Hello, is there any way to delete the tableconfig without
deleting segments and create the same table with the same segments from the
disk for realtime? I happen to execute a wrong clusterconfig rest call and
broke the broker UI . I tried updating it again but no luck. So planning to
recreate the tableconfig without losing data.
**@mayanks:** No that I am aware of (for realtime). What exactly broke in the
broker?
**@pabraham.usa:** The Cluster Manager section stopped loading. I can see some
js error in the javascript console. All ingestion and searches are working
fine.
**@mayanks:** Restart controller?
**@pabraham.usa:** not helping, seems like the bad config is somehow coming
from zookeeper.
**@mayanks:** Yeah, can't think of an easy fix
**@pabraham.usa:** will do the hard way.
**@jackie.jxt:** If you know which zk record is breaking, you may manually fix
it via the zookeeper browser (I assume this is the hard way)
**@mayanks:** Hard way was delete and recreate table, I think
**@ssubrama:** For offline segments, you an re-load them from the
Deleted_Segments folder in your PinotFS after recreating the table. For
realtime tables, you can consume from earliest available row in the stream.
Note that when it starts to consume feverishly, query performance will suffer.
Also, if the data has already been retained out from the underlying stream,
then it is gone forever, sorry.
**@pabraham.usa:** @jackie.jxt @mayanks correct delete and recreate
unfortunately.
**@pabraham.usa:** As it seems its not straightforward to edit zookeeper data.
@ssubrama it is good to know that I can restore from Deleted_Segments folder.
Do you have any link related to this. Also to fetch data from stream I may
have to somehow specify the offset to start with right? Where can can specify
that?
**@pabraham.usa:** @ssubrama it actually caught up with the stream from the
beginning. Which is good. However the restore took some time. Restoring from
Deleted_Segments sounds like the fastest option.
**@ssubrama:** Note that this restore option is only for offline tables. You
can check the root directory of your PinotFS (underwhich there are table
directories), and there should be a folder for deleted segments
**@pabraham.usa:** ohh ok , Is it possible to copy segments created by
realtime manually to Deleted_Segments folder then create a offline table from
it. Then update the offline table config to realtime?
**@ssubrama:** Well, the realtime segments are also available in
Deleted_Segments folder. But they cannot be uploaded into the table in any
easy manner. If it is a production issue ($$$ at stake) , you can probably
recover, but it needs a lot of manual work.
**@pabraham.usa:** Actually my stream retention matches pinot retention so
there is no data loss. However what is the best way to do proper backup and
restore ? is S3 or EFS based deep store a good way to go?
**@vadlamani1729:** @vadlamani1729 has joined the channel
**@elon.azoulay:** Does anyone here impose a limit on
`pinot.broker.query.response.limit` - we are thinking to limit to 1k, and were
wondering what other pinot installations use.
**@jackie.jxt:** This config is used to prevent super expensive queries
exhaust the resource on the servers. Based on your use case, you may choose
the value accordingly. If normally you won't run query with limit higher than
1000, you can set it to 1000, and it will bound the limit to 1000

### _#query-latency_

**@falexvr:** @falexvr has joined the channel
**@falexvr:** @falexvr has left the channel

### _#pinot-dev_

**@luanmorenomaciel:** @elon.azoulay and @mayanks any yaml file that you
could share with me regarding schema registry integration
**@elon.azoulay:** For the stream configs here is an example:
```"streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type":
"LowLevel", "stream.kafka.topic.name": "<my topic>",
"stream.kafka.broker.list": "<my broker host>:9092",
"realtime.segment.flush.threshold.time": "6h",
"realtime.segment.flush.threshold.size": "0",
"realtime.segment.flush.desired.size": "200M",
"stream.kafka.consumer.prop.isolation.level": "read_committed",
"stream.kafka.consumer.prop.auto.offset.reset": "smallest",
"stream.kafka.consumer.prop.group.id": "<a uuid>",
"stream.kafka.consumer.prop.client.id": "<another uuid>",
"stream.kafka.consumer.factory.class.name":
"org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
"stream.kafka.decoder.class.name":
"org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder",
"stream.kafka.decoder.prop.schema.registry.rest.url": "http://<schema
registry>:8081" }```
**@elon.azoulay:** We do not use ssl but I believe @mayanks might have more
context about that.
**@luanmorenomaciel:** that's great @elon.azoulay I'll try to implement and
let you know?

### _#webex_

**@moonbow:** @moonbow has joined the channel
**@moonbow:** @moonbow set the channel purpose: Cisco Webex
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@pinot.apache.org For additional commands,
e-mail: dev-help@pinot.apache.org